WO2017157198A1 - Attribute acquisition method and device - Google Patents

Attribute acquisition method and device Download PDF

Info

Publication number
WO2017157198A1
WO2017157198A1 PCT/CN2017/075829 CN2017075829W WO2017157198A1 WO 2017157198 A1 WO2017157198 A1 WO 2017157198A1 CN 2017075829 W CN2017075829 W CN 2017075829W WO 2017157198 A1 WO2017157198 A1 WO 2017157198A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute
target
word
candidate
platform
Prior art date
Application number
PCT/CN2017/075829
Other languages
French (fr)
Chinese (zh)
Inventor
陈强
吴夙慧
郭立超
李传福
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017157198A1 publication Critical patent/WO2017157198A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present invention relates to information technology, and in particular, to an attribute acquisition method and apparatus.
  • a product library can be maintained for the published product, and the product, the material, the color, the style, the price range and the like are determined according to the product category of the product, and the product is described. This facilitates statistics and user screening.
  • the original platform such as Intime Commercial needs to access the target platform such as Taobao
  • the attributes used to describe the product on the original platform including the attribute items and attribute values, are often different from the target platform. For example, on the Intime commercial platform, brands, colors, materials and time-to-market descriptions of the products under the category of dresses were used, while on the Taobao platform, brands, color classifications, styles and price ranges were used. Therefore, before releasing the product on the Taobao platform, it is necessary to determine the attribute value of each attribute item when the product on the Intime commercial platform is described in the Taobao platform, that is, obtain the attribute of the item on the target platform.
  • the attributes of the original platform product may be clustered according to the attributes of the target platform, thereby obtaining the attribute of the product on the target platform, but the method can only process the attribute of the product on the original platform. It is not possible to process unstructured text such as titles or detailed descriptions of products on the original platform.
  • the present invention provides an attribute acquisition method and apparatus for processing an attribute of an item based on an unstructured text such as a title or a detailed description of an item on the original platform.
  • an attribute acquisition method including:
  • an attribute obtaining apparatus including:
  • An extraction module configured to extract, from the unstructured text used to describe the target object, a target word preset attribute that matches the preset attribute;
  • a determining module configured to determine an attribute preset attribute of the target object according to the target word.
  • the attribute obtaining method and device provided by the embodiment of the present invention extracts a target word that matches a preset attribute of the target platform from the unstructured text used by the original platform to describe the target object, and then determines the target object according to the target word. Attributes in the target platform. For the e-commerce platform, the attributes of the product can be extracted from the unstructured text of the title and the detailed description of the product, thereby solving the problem that the prior art cannot process the unstructured text and obtain the original platform. Technical issues with attributes on the target platform.
  • FIG. 1 is a schematic flowchart diagram of an attribute obtaining method according to Embodiment 1;
  • FIG. 2 is a schematic diagram of an application scenario of an attribute acquisition method
  • FIG. 3 is a schematic flowchart of an attribute obtaining method according to Embodiment 2 of the present invention.
  • FIG. 4 is a schematic structural diagram of an attribute obtaining apparatus according to Embodiment 3 of the present invention.
  • FIG. 5 is a schematic structural diagram of an attribute obtaining apparatus according to Embodiment 4 of the present invention.
  • FIG. 1 is a schematic flowchart of a method for obtaining an attribute according to the first embodiment.
  • the method provided in this embodiment may be used in an e-commerce platform, that is, the object mentioned in this embodiment may be a commodity, and implemented.
  • the example can be used to obtain the attributes of the product in the target platform before the goods in the original platform are delivered to the target platform, as shown in FIG. 1 As shown, the method includes:
  • Step 101 Extract, from the unstructured text used to describe the target object, a target word that matches the preset attribute.
  • the preset attribute includes a preset attribute item and a preset attribute value.
  • the corresponding preset attribute value may be composed of one or more words.
  • the correspondence between the preset attribute value and the plurality of preset attribute sub-values may also be set, where the preset attribute Subvalues have similar semantics to preset attribute values.
  • a vocabulary for describing different clothing styles may be set as a preset attribute value.
  • a plurality of vocabulary with similar semantics as a preset attribute sub-value for each clothing style vocabulary.
  • the nationality may be set as a preset attribute value, and further, the Miao, Han, and Vietnamese may also be set.
  • the vocabulary of the nationality is used as the default attribute sub-value, and if the college is set as the preset attribute value, the campus, the literary art, the small fresh, and the like are set to describe the vocabulary of the college style as the preset attribute sub-value. .
  • the words in the unstructured text are matched with the words corresponding to the preset attribute item, and if there is at least one matching word, the word is considered to match the preset attribute, and then the word is determined as the target word.
  • the unstructured text can be pre-processed by obtaining the unstructured text such as the title and detailed description of the target object in the original platform.
  • the pre-processing operations mainly include word segmentation, full-width half-width, unified case, and text. Perform normalization, accurately identify brand words, and process single words. Then, in the target platform, the preset attribute under the category to which the target object belongs is queried.
  • the similarity algorithm is used to perform string matching on the unstructured text and the preset attribute, obtain a target word such as a matched word, and obtain a matching degree between each target word and a preset attribute. By performing string matching, vocabulary similar to the preset attribute is found from the unstructured text.
  • the similarity algorithm used here may include: edit distance, cosine angle similarity, Euclidean distance, and Jacarrd genetic similarity distance (Jacarrd is An algorithm of genetic similarity), a binary grammar (2-Gram) language model, a longest common subsequence, a longest continuous common substring, and the like.
  • the target words such as semantic matching, may be extracted from the unstructured text in other manners.
  • the category mentioned above refers to the category to which the object belongs
  • the granularity of the category can be set by the user, for example, can be generally divided into clothing, shoes, hats, electronic products, etc., and can be further refined. For example, for clothing, it can be divided into more fine-grained shirts, dresses, pants, and the like. The finer the granularity of the category division, the more The accuracy of the attributes retrieved is higher, but the corresponding preset attributes that need to be maintained are more.
  • the granularity set by the category can refer to the difference of the preset attributes between two different categories. The division of the categories should make certain differences between the preset attributes of the two categories, so as to ensure the obtained Under the premise of the accuracy of the attribute, maintain a preset set of attributes of appropriate size.
  • Step 102 Determine an attribute of the target object according to the target word.
  • the attribute of the target object is determined from the target word according to the matching degree of the target word and the preset attribute.
  • the target object may be determined from the target word according to the matching degree between the target word and the preset attribute by matching the target word with the preset attribute value and/or the preset attribute sub-value in the preset attribute.
  • the similarity threshold that is, the first threshold and the second threshold, are set in advance, wherein the first threshold is greater than the second threshold.
  • the semantic determination manner is used to determine the Whether the candidate attribute is an attribute in the target platform, and determining, according to the determination result, the attribute of the target object in the target platform from the candidate attribute.
  • the matching degree is between 0 and 1
  • the matching degree obtained in the previous step is compared with the first threshold and the second threshold.
  • a target word whose matching degree is less than the first threshold but greater than the second threshold it is considered that it may be an attribute of the target object, and these target words may be used as candidate attributes, and further judgment is needed, specifically in the implementation.
  • the semantic judgment method is used for further judgment;
  • the probability that the attribute of the target object is considered to be low is directly discarded.
  • the target word matching the preset attribute of the target platform is extracted, and then the matching degree between the target word and the preset attribute is determined, and the target word is determined from the target word.
  • the scheme of the attribute of the target object in the target platform can realize the extraction of the attribute of the commodity from the unstructured text of the title and the detailed description of the commodity, thereby solving the problem that the unstructured text cannot be processed in the prior art.
  • the analysis may be performed based on the semantics of the target word to obtain the attribute of the target object.
  • the target word obtained by extracting the words in the detailed description page of the product may be “Miao tradition”.
  • “Apparel” analyzes the semantics of the target words, and determines that the semantics of "Miao traditional costumes” is used to describe the national style, so the national style can be used as the attribute of the commodity.
  • the semantic analysis here can be based on similar semantics and general semantics. A variety of semantic relationships are analyzed. Specifically, similar semantics means that attributes can have similar semantics with target words. Generalized semantics refers to the concept that attributes can be up and down between target words.
  • the preset attribute corresponding to the preset attribute sub-value can be obtained according to the preset attribute sub-value matched by the target word.
  • the value is used as the attribute value of the item, and the preset attribute item corresponding to the preset attribute value is used as the attribute item of the item.
  • the attribute of the product in the target platform can be obtained through the description page of the product in the original platform.
  • 2 is a schematic diagram of an application scenario of the attribute acquisition method.
  • the left picture is a product page in the original platform, and the product title and product details are included in the page, and the target word is extracted from the product title and the product details, according to The extracted target word obtains a list of product attributes as shown in the right figure, and the item attribute list can be used for screening the items.
  • the product attribute includes the item attribute item and the attribute value of the item, the first column is the attribute item of the item, and the second column is the attribute value of the item.
  • FIG. 3 is a schematic diagram of Embodiment 2 of the present invention.
  • a schematic diagram of a process for obtaining an attribute, as shown in FIG. 3, includes:
  • Step 201 Based on the unstructured text used to describe the target commodity in the original platform, predict the target product to be in the category to which the target platform belongs.
  • a classification model may be constructed in advance, for example, the classification model may be a simple Bayesian algorithm classification model.
  • the category corresponding to each keyword is determined according to the category of the clicked product after the search in the click data, and the correspondence between the keyword and the category is obtained.
  • the keyword is segmented, the term is obtained, and the term is replaced by the keyword in the correspondence between the keyword and the category, and the correspondence between the term and the category is obtained.
  • the correspondence between the entry and the category is used as the training set, the classification model is trained, the classification model is trained, and the classification model is constructed.
  • the unstructured text can be described as a title and/or a detail page.
  • the title of the target product in the third-party platform can be segmented to obtain the title of the title, and then the title of the title is marked with the part of speech.
  • the part of speech information of each entry Using the word-loss algorithm, the words are processed according to the part of speech information, so that some of the interference words in the target product title are discarded, and only product words, modifiers, brand words, time-season words, promotional words, and the like are retained. Enter the retained terms into the trained classification model to obtain the categories of the target products on the Taobao platform.
  • the accurate categories of the target products in the target platform can be obtained based on the prediction mode, so that the target attributes can be obtained by matching the preset attributes based on the category. , to improve the possibility of the target product attribute in the obtained target word.
  • Step 202 Extract a target word that matches the preset attribute under the predicted category from the unstructured text.
  • the pre-processed unstructured text is subjected to similarity calculation, and the target word matching the preset attribute is obtained, and the matching degree is obtained.
  • the matching degree can be written as sim1.
  • the matching degree is used to describe the degree of similarity between the target word and the preset attribute.
  • the preset attribute includes two parts, namely an attribute item and an attribute value. If the target word is similar to the attribute value in the preset attribute, the target word is matched with the preset attribute, and the target word and the matched attribute may be The attribute item combination forms an attribute pair as PV.
  • Step 203 Determine, according to the matching degree of the target word, the attribute and the candidate attribute of the target object in the target platform from the target word.
  • the target word whose similarity sim5 is greater than the preset threshold a is used as the attribute of the target object in the target platform; the target word whose similarity is smaller than the preset threshold a and larger than the preset threshold b is used as the candidate attribute.
  • the target word whose similarity is smaller than the preset threshold a and larger than the preset threshold b is used as the candidate attribute.
  • Step 204 Match the stored target platform products in the database for the target words determined as attributes, and extract the attributes of the candidate products in the matching.
  • the database includes a product library and a commodity library, and the product library does not include the merchant field compared with the commodity library, and the remaining data may be identical. That is to say, each record in the product library corresponds to one product, and each record in the product library corresponds to one product provided by one merchant.
  • the query is performed in the product library, and the candidate products in the product library that match all the target words determined as attributes are obtained through the query.
  • the query is performed in the product library, and all the target words in the product library and determined as attributes are obtained through the query.
  • the candidate in the match is performed in the product library, and all the target words in the product library and determined as attributes are obtained through the query. The candidate in the match.
  • the attributes of all the candidate products obtained by the two queries are used as the attributes of the target item, and the confidence of each attribute is calculated.
  • Step 205 Calculate a confidence level of each attribute of the candidate item.
  • the confidence level is used to indicate the accuracy of describing the target item in the target platform.
  • the confidence value of each attribute of the candidate product may be directly set to 100%, or may be calculated by the confidence calculation formula mentioned below, and the result is calculated. Are the same.
  • the confidence calculation formula is as follows:
  • the attribute pairs formed by the target words are: P1V1 and P2V2
  • the PV pairs of the candidate products are:
  • P1V1, P2V2, P3V3, P7V7, and P8V8 are output as attributes of the target item.
  • the confidence levels of P1V1, P2V2, P3V3, P7V7, and P8V8 are calculated, which are 100%, 100%, 33.3%, 33.3%, and 33.3%, respectively.
  • Step 206 Determine, for a target word determined as a candidate attribute, a semantic discriminant manner to determine a candidate attribute as a confidence level of the attribute in the target platform.
  • semantic discrimination is performed.
  • the preset attribute values in the target platform are separated according to words in advance.
  • the word2vec algorithm is used for model training, and the target words determined as candidate attributes are input into the trained discriminant model to obtain the word vector, and the word vectors are accumulated.
  • the word vector is obtained, and the cosine value of the word vector is used as the candidate attribute as the confidence sim2 of the attribute in the target platform.
  • semantic discrimination is based on the context of the target word in the unstructured text.
  • the title or detail page of each product in the target platform is used as a corpus, and the word segmentation result is used as the training text.
  • the word2vec algorithm is used for model training, and the target word determined as the candidate attribute is input into the trained discriminant model to obtain the word.
  • the vector takes the cosine of the word vector as the candidate attribute as the confidence sim3 of the attribute in the target platform.
  • the similarities sim2 and sim3 obtained according to the two semantic discriminating methods determine the candidate attribute as the confidence S of the attribute in the target platform. For example: using a weighted sum or weighted average of sim2 and sim3 Calculate the confidence S.
  • the calculated confidence level may be corrected by calculating the confidence S, referring to the candidate products in the previous step, and counting the frequency of occurrence of each candidate attribute in the attributes of the candidate product, and obtaining the corrected confidence. Confidence S.
  • Step 207 Collect the target words determined as the attribute and the candidate attribute, and the attributes of the candidate item, and determine the attribute of the target item from the summary result according to the confidence.
  • the threshold of confidence can be determined by obtaining the required accuracy based on the attributes. The higher the accuracy required, the higher the confidence threshold can be raised, and if the required accuracy is lower, a lower confidence threshold can be set.
  • the target word with a confidence greater than the confidence threshold is selected from the summary results as the attribute of the target commodity.
  • FIG. 4 is a schematic structural diagram of an attribute obtaining apparatus according to Embodiment 3 of the present invention. As shown in FIG. 4, the method includes: an extracting module 31 and a determining module 32.
  • the extracting module 31 is configured to extract, from the unstructured text used to describe the target object, a target word that matches the preset attribute;
  • the extraction module 31 is specifically configured to perform a string matching on the unstructured text and the preset attribute by using a similarity algorithm to obtain a matching target word and a corresponding matching degree.
  • the determining module 32 is configured to determine an attribute preset attribute of the target object according to the target word.
  • the determining module 32 is specifically configured to determine an attribute of the target object from the target words according to a matching degree between the target word and the preset attribute.
  • the determining module 32 is specifically configured to perform an analysis based on the semantics of the target word to obtain an attribute of the target object.
  • the target word matching the preset attribute of the target platform is extracted from the unstructured text used to describe the target object by the original platform, and then the attribute of the target object in the target platform is determined according to the target word.
  • the solution can realize the extraction of the attributes of the goods from the unstructured text of the title and the detailed description of the product, thereby solving the problem that the prior art cannot process the unstructured text and obtain the original platform on the target platform.
  • FIG. 5 is a schematic structural diagram of an attribute obtaining apparatus according to Embodiment 4 of the present invention, and the genus provided in FIG. 4 Based on the sexual acquisition device, the determination module 32 further includes: a first determination unit 321 and a second determination unit 322.
  • the first determining unit 321 is configured to determine, as a target word whose matching degree is higher than the first threshold, an attribute of the target object in the target platform.
  • the second determining unit 322 is configured to determine, by using a semantic discriminant manner, whether the candidate attribute is an attribute in the target platform, for a target word whose matching degree is higher than a second threshold but smaller than the first threshold, as a candidate attribute, according to The discriminating result determines an attribute of the target object in the target platform from the candidate attributes.
  • the second determining unit 322 may include at least one of the first discriminating subunit 3221 and the second discriminating subunit 3222.
  • the second determining unit 322 in FIG. 4 includes a first discriminating subunit 3221 and a second discriminating subunit 3222.
  • the first discriminating sub-unit 3221 is configured to perform semantic discriminating based on the relationship between words and words in the candidate attribute, and obtain the confidence that the candidate attribute is an attribute in the target platform.
  • the first discriminating sub-unit 3221 is specifically configured to input each character in the candidate attribute into a pre-trained inter-word semantic discriminant model to obtain a word vector; the inter-word semantic discriminant model is to use the target platform
  • Each character in the attribute is obtained by training as a training text; accumulating the word vector to obtain a first word vector; using a cosine value of the first word vector as the candidate attribute as an attribute in the target platform Confidence.
  • the second discriminating sub-unit 3222 is configured to perform semantic discriminating based on the context relationship of the candidate attribute in the unstructured text, and obtain the confidence that the candidate attribute is an attribute in the target platform.
  • the second discriminating sub-unit 3222 is specifically configured to input each word in the unstructured text into a pre-trained inter-word semantic discriminant model to obtain a second word vector;
  • the inter-word semantic discriminant model is Each word in the unstructured text in the target platform is trained as training text; and a cosine value of the second word vector is used as the candidate attribute as a confidence level of an attribute in the target platform.
  • the second determining unit 322 may further include: an attribute determining subunit 3223.
  • the attribute determining sub-unit 3223 is configured to determine, according to the confidence, an attribute of the target object in the target platform from the candidate attributes.
  • the determining module 32 further includes: a matching unit 323.
  • the matching unit 323 is configured to match the target word whose matching degree is higher than the first threshold with the attributes of each object in the target platform stored in the database, to obtain candidate objects in the matching; according to the attributes of each candidate object a frequency of occurrence of attributes of all candidate objects, calculating a probability that an attribute of the candidate object is an attribute of the target object in the target platform; determining, according to the calculated probability, the target object from among attributes of the candidate object The attributes in the target platform.
  • the attribute obtaining apparatus further includes: a category prediction module 33 and a preset attribute determining module 34.
  • the category prediction module 33 is configured to predict, according to the unstructured text, the category of the target object in the target platform.
  • the preset attribute determining module 34 is configured to use an attribute under the category in the target platform as the preset attribute.
  • the category prediction module 33 includes: a mining unit 331 and a modeling unit 332.
  • the mining unit 331 is configured to perform data mining using the trained classification model based on the unstructured text of the target object, and obtain the category of the target object in the target platform.
  • the modeling unit 332 is configured to acquire a user search keyword and a category to which the selected object is selected from the search result; perform word segmentation processing on the keyword to obtain a search term; according to the search term and the selected item
  • the category to which the object belongs generates a training set; the training model is used to train the classification model.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

An attribute acquisition method and a device, the method comprising: extracting a target word matching a preset attribute of a target platform from an unstructured text for describing a target object in an original platform (101); and then determining the attribute of the target object in the target platform according to the target word (102). For e-commerce platforms, the attribute of goods can be extracted from such unstructured texts as the product title and detailed description in the original platform, thereby solving the technical problem that in the prior art, the unstructured text cannot be processed to obtain the attribute of goods in the original platform on the target platform.

Description

属性获取方法和装置Attribute acquisition method and device
本申请要求2016年03月17日递交的申请号为201610154037.9、发明名称为“属性获取方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application Serial No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No
技术领域Technical field
本发明涉及信息技术,尤其涉及一种属性获取方法和装置。The present invention relates to information technology, and in particular, to an attribute acquisition method and apparatus.
背景技术Background technique
在电子商务处理平台中,可以对所发布的商品维护一个商品库,在商品库中按照商品的商品类目,确定了品牌、材质、颜色、风格、价格区间等等属性项对商品进行描述,从而便于进行统计和用户进行筛选。原平台比如银泰商业需要接入淘宝等目标平台时,在目标平台上发布商品时,由于原平台上用于描述商品的属性,包括属性项和属性值,与目标平台往往是有所区别的。例如:在银泰商业平台上,采用了品牌、颜色、材质和上市时间描述连衣裙这一商品类目下的商品,而在淘宝平台上则采用了品牌、颜色分类、风格和价格区间。因此,在淘宝平台上发布商品之前,需要确定银泰商业平台上的商品在淘宝平台中进行描述时各属性项的属性值,也就是获取到该商品在目标平台上的属性。In the e-commerce processing platform, a product library can be maintained for the published product, and the product, the material, the color, the style, the price range and the like are determined according to the product category of the product, and the product is described. This facilitates statistics and user screening. When the original platform, such as Intime Commercial needs to access the target platform such as Taobao, when the product is released on the target platform, the attributes used to describe the product on the original platform, including the attribute items and attribute values, are often different from the target platform. For example, on the Intime commercial platform, brands, colors, materials and time-to-market descriptions of the products under the category of dresses were used, while on the Taobao platform, brands, color classifications, styles and price ranges were used. Therefore, before releasing the product on the Taobao platform, it is necessary to determine the attribute value of each attribute item when the product on the Intime commercial platform is described in the Taobao platform, that is, obtain the attribute of the item on the target platform.
现有技术中可以根据目标平台的属性,对原平台商品的属性进行聚类处理,从而获得在目标平台上该商品的属性,但这种方式仅能够针对该商品在原平台上的属性进行处理,而无法对商品在原平台上的标题或详情描述等非结构化文本进行处理。In the prior art, the attributes of the original platform product may be clustered according to the attributes of the target platform, thereby obtaining the attribute of the product on the target platform, but the method can only process the attribute of the product on the original platform. It is not possible to process unstructured text such as titles or detailed descriptions of products on the original platform.
发明内容Summary of the invention
本发明提供一种属性获取方法和装置,用于基于商品在原平台上的标题或详情描述等非结构化文本进行处理获得该商品的属性。The present invention provides an attribute acquisition method and apparatus for processing an attribute of an item based on an unstructured text such as a title or a detailed description of an item on the original platform.
为达到上述目的,本发明的实施例采用如下技术方案:In order to achieve the above object, embodiments of the present invention adopt the following technical solutions:
第一方面,提供了一种属性获取方法,包括:In a first aspect, an attribute acquisition method is provided, including:
从用于描述目标对象的非结构化文本中,抽取出与预设属性匹配的目标词;Extracting a target word that matches the preset attribute from the unstructured text used to describe the target object;
根据所述目标词确定所述目标对象的属性。第二方面,提供了一种属性获取装置,包括:Determining an attribute of the target object according to the target word. In a second aspect, an attribute obtaining apparatus is provided, including:
抽取模块,用于从用于描述目标对象的非结构化文本中,抽取出与预设属性匹配的目标词预设属性; An extraction module, configured to extract, from the unstructured text used to describe the target object, a target word preset attribute that matches the preset attribute;
确定模块,用于根据所述目标词确定所述目标对象的属性预设属性。a determining module, configured to determine an attribute preset attribute of the target object according to the target word.
本发明实施例提供的属性获取方法和装置,通过从原平台用于描述目标对象的非结构化文本中,抽取出与目标平台的预设属性匹配的目标词,进而根据目标词确定出目标对象在目标平台中的属性。针对电子商务平台来说,可以实现从商品的标题和详情描述这种非结构化文本中抽取出商品的属性,因此解决了现有技术中无法针对非结构化文本进行处理,获得原平台的商品在目标平台上的属性的技术问题。The attribute obtaining method and device provided by the embodiment of the present invention extracts a target word that matches a preset attribute of the target platform from the unstructured text used by the original platform to describe the target object, and then determines the target object according to the target word. Attributes in the target platform. For the e-commerce platform, the attributes of the product can be extracted from the unstructured text of the title and the detailed description of the product, thereby solving the problem that the prior art cannot process the unstructured text and obtain the original platform. Technical issues with attributes on the target platform.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1为实施例一提供的一种属性获取方法的流程示意图;FIG. 1 is a schematic flowchart diagram of an attribute obtaining method according to Embodiment 1;
图2为属性获取方法的应用场景示意图;2 is a schematic diagram of an application scenario of an attribute acquisition method;
图3为本发明实施例二提供的一种属性获取方法的流程示意图;3 is a schematic flowchart of an attribute obtaining method according to Embodiment 2 of the present invention;
图4为本发明实施例三提供的一种属性获取装置的结构示意图;4 is a schematic structural diagram of an attribute obtaining apparatus according to Embodiment 3 of the present invention;
图5为本发明实施例四提供的一种属性获取装置的结构示意图。FIG. 5 is a schematic structural diagram of an attribute obtaining apparatus according to Embodiment 4 of the present invention.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the embodiments of the present invention have been shown in the drawings, the embodiments Rather, these embodiments are provided so that this disclosure will be more fully understood and the scope of the disclosure will be fully disclosed.
下面结合附图对本发明实施例提供的属性获取方法和装置进行详细描述。The attribute acquisition method and apparatus provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
实施例一Embodiment 1
图1为实施例一提供的一种属性获取方法的流程示意图,本实施例所提供的方法可以用于电子商务平台中,也就是说,本实施例中所提及的对象可以为商品,实施例可以用于将原平台中的商品投放到目标平台之前,获得该商品在目标平台中的属性,如图1 所示,方法包括:FIG. 1 is a schematic flowchart of a method for obtaining an attribute according to the first embodiment. The method provided in this embodiment may be used in an e-commerce platform, that is, the object mentioned in this embodiment may be a commodity, and implemented. The example can be used to obtain the attributes of the product in the target platform before the goods in the original platform are delivered to the target platform, as shown in FIG. 1 As shown, the method includes:
步骤101、从用于描述目标对象的非结构化文本中,抽取出与预设属性匹配的目标词。Step 101: Extract, from the unstructured text used to describe the target object, a target word that matches the preset attribute.
其中,预设属性包括了预设属性项和预设属性值。针对同一个预设属性项,可以由一个或多个词汇构成对应的预设属性值。可选的,在设置预设属性项和预设属性值之间对应关系之后,还可以针对每一个预设属性值设置与多个预设属性子值之间的对应关系,其中,预设属性子值与预设属性值具有相似语义。The preset attribute includes a preset attribute item and a preset attribute value. For the same preset attribute item, the corresponding preset attribute value may be composed of one or more words. Optionally, after setting the correspondence between the preset attribute item and the preset attribute value, the correspondence between the preset attribute value and the plurality of preset attribute sub-values may also be set, where the preset attribute Subvalues have similar semantics to preset attribute values.
例如:针对服饰风格的预设属性项,可以设置用于描述不同服饰风格的词汇作为预设属性值。进一步,还可以针对还可以针对每一个服饰风格词汇设置多个具有相似语义的词汇作为预设属性子值,具体来说,可以设置民族为预设属性值,进而还可以设置苗族、汉族、藏族等具体描述民族的词汇作为预设属性子值,又如还可以在设置学院作为预设属性值的同时设置校园、文艺和小清新等等用于具体描述学院风格的词汇作为预设属性子值。For example, for a preset attribute item of a clothing style, a vocabulary for describing different clothing styles may be set as a preset attribute value. Further, it is also possible to set a plurality of vocabulary with similar semantics as a preset attribute sub-value for each clothing style vocabulary. Specifically, the nationality may be set as a preset attribute value, and further, the Miao, Han, and Tibetan may also be set. Specifically, the vocabulary of the nationality is used as the default attribute sub-value, and if the college is set as the preset attribute value, the campus, the literary art, the small fresh, and the like are set to describe the vocabulary of the college style as the preset attribute sub-value. .
需要说明的是,这里所说的匹配不仅是指绝对匹配,还包括部分匹配的情况。It should be noted that the matching mentioned here refers not only to absolute matching but also to partial matching.
具体的,将非结构化文本中的词与该预设属性项对应的各个词进行匹配,若存在至少一个匹配的词汇则认为该词与预设属性匹配,进而确定该词为目标词。在匹配之前,可以通过获取目标对象在原平台中的标题和详情描述等非结构化文本,对这些非结构化文本进行预处理,预处理操作主要包括分词、全角换半角、大小写统一、对文本进行归一、对品牌词准确识别、处理单字等。进而在目标平台中,查询目标对象所属类目下的预设属性。采用相似度算法,对所述非结构化文本与所述预设属性进行字符串匹配,获得匹配的单词等目标词,并且获得每一个目标词与预设属性之间的匹配度。通过进行字符串匹配,从非结构化文本中找到与预设属性相似的词汇,这里使用的相似度算法可以包括:编辑距离、余弦夹角相似度、欧式距离、Jacarrd遗传相似度距离(Jacarrd是一种遗传相似度的算法)、二元语法(2-Gram)语言模型、最长公共子序列、最长连续公共子串等。Specifically, the words in the unstructured text are matched with the words corresponding to the preset attribute item, and if there is at least one matching word, the word is considered to match the preset attribute, and then the word is determined as the target word. Before the matching, the unstructured text can be pre-processed by obtaining the unstructured text such as the title and detailed description of the target object in the original platform. The pre-processing operations mainly include word segmentation, full-width half-width, unified case, and text. Perform normalization, accurately identify brand words, and process single words. Then, in the target platform, the preset attribute under the category to which the target object belongs is queried. The similarity algorithm is used to perform string matching on the unstructured text and the preset attribute, obtain a target word such as a matched word, and obtain a matching degree between each target word and a preset attribute. By performing string matching, vocabulary similar to the preset attribute is found from the unstructured text. The similarity algorithm used here may include: edit distance, cosine angle similarity, Euclidean distance, and Jacarrd genetic similarity distance (Jacarrd is An algorithm of genetic similarity), a binary grammar (2-Gram) language model, a longest common subsequence, a longest continuous common substring, and the like.
在本步骤中不仅可以采用前述提及的字符串匹配,也可以采用其他方式从非结构化文本中抽取出目标词,例如语义匹配等。In this step, not only the string matching mentioned above may be used, but also the target words, such as semantic matching, may be extracted from the unstructured text in other manners.
需要说明的是,前述提及的类目是指对象所属类别,类目的粒度可以由用户自行进行设定,例如可以笼统的划分为服装、鞋帽、电子产品等,也可以进行进一步的细分,例如对于服装可以划分为更细粒度的衬衫、连衣裙、裤子等。类目划分的粒度越细,获 取到的属性的准确度越高,但相应的需要维护的预设属性越多。类目所设定的粒度可以参考两个不同类目间的预设属性存在的差异性,类目的划分应当使得两类目间预设属性的存在一定的差异性,从而在保证获取到的属性的准确度的前提下,维持一个适当规模的预设属性集合。It should be noted that the category mentioned above refers to the category to which the object belongs, and the granularity of the category can be set by the user, for example, can be generally divided into clothing, shoes, hats, electronic products, etc., and can be further refined. For example, for clothing, it can be divided into more fine-grained shirts, dresses, pants, and the like. The finer the granularity of the category division, the more The accuracy of the attributes retrieved is higher, but the corresponding preset attributes that need to be maintained are more. The granularity set by the category can refer to the difference of the preset attributes between two different categories. The division of the categories should make certain differences between the preset attributes of the two categories, so as to ensure the obtained Under the premise of the accuracy of the attribute, maintain a preset set of attributes of appropriate size.
步骤102、根据目标词确定目标对象的属性。Step 102: Determine an attribute of the target object according to the target word.
作为一种可能的实现方式,根据目标词与预设属性的匹配度,从目标词中确定目标对象的属性。As a possible implementation manner, the attribute of the target object is determined from the target word according to the matching degree of the target word and the preset attribute.
可以通过将目标词与预设属性中的预设属性值和/或预设属性子值进行匹配,从而根据目标词与预设属性的匹配度,从目标词中确定目标对象的属性。具体的,预先设置相似度阈值,即第一阈值和第二阈值,其中,第一阈值大于第二阈值。对于匹配度高于第一阈值的目标词,确定为目标对象在目标平台中的属性;对于匹配度高于第二阈值但小于第一阈值的目标词作为候选属性,采用语义判别方式确定所述候选属性是否为所述目标平台中的属性,根据判别结果从所述候选属性中确定所述目标对象在目标平台中的属性。The target object may be determined from the target word according to the matching degree between the target word and the preset attribute by matching the target word with the preset attribute value and/or the preset attribute sub-value in the preset attribute. Specifically, the similarity threshold, that is, the first threshold and the second threshold, are set in advance, wherein the first threshold is greater than the second threshold. For the target word whose matching degree is higher than the first threshold, the attribute of the target object in the target platform is determined; for the target word whose matching degree is higher than the second threshold but smaller than the first threshold as the candidate attribute, the semantic determination manner is used to determine the Whether the candidate attribute is an attribute in the target platform, and determining, according to the determination result, the attribute of the target object in the target platform from the candidate attribute.
一般来说,匹配度的取值在0至1之间,在上一步骤中所获得的匹配度与第一阈值和第二阈值相比较,有三种情况:Generally, the matching degree is between 0 and 1, and the matching degree obtained in the previous step is compared with the first threshold and the second threshold. There are three cases:
第一种情况,针对匹配度大于第一阈值的目标词,认为有较大概率是该目标对象的属性;In the first case, for a target word whose matching degree is greater than the first threshold, it is considered that a greater probability is an attribute of the target object;
第二种情况,针对匹配度小于第一阈值但大于第二阈值的目标词,认为其有可能是目标对象的属性,可以将这些目标词作为候选属性,需要进行进一步的判断,具体在本实施例中采用语义判别方式进行了进一步判断;In the second case, for a target word whose matching degree is less than the first threshold but greater than the second threshold, it is considered that it may be an attribute of the target object, and these target words may be used as candidate attributes, and further judgment is needed, specifically in the implementation. In the example, the semantic judgment method is used for further judgment;
第三种情况,针对匹配度小于第二阈值的目标对象,认为是目标对象的属性的概率很低,直接进行舍弃。In the third case, for a target object whose matching degree is smaller than the second threshold, the probability that the attribute of the target object is considered to be low is directly discarded.
可见,通过从原平台用于描述目标对象的非结构化文本中,抽取出与目标平台的预设属性匹配的目标词,进而根据目标词与预设属性的匹配度,从目标词中确定出目标对象在目标平台中的属性的方案,可以实现从商品的标题和详情描述这种非结构化文本中抽取出商品的属性,因此解决了现有技术中无法针对非结构化文本进行处理,获得原平台的商品在目标平台上的属性的技术问题。It can be seen that, by using the unstructured text used to describe the target object from the original platform, the target word matching the preset attribute of the target platform is extracted, and then the matching degree between the target word and the preset attribute is determined, and the target word is determined from the target word. The scheme of the attribute of the target object in the target platform can realize the extraction of the attribute of the commodity from the unstructured text of the title and the detailed description of the commodity, thereby solving the problem that the unstructured text cannot be processed in the prior art. The technical problem of the attributes of the original platform's merchandise on the target platform.
作为另一种可能的实现方式,可以基于目标词的语义进行分析,获得所述目标对象的属性。例如:对商品的详情描述页中的词进行抽取所获得的目标词可以为“苗族传统 服饰”,针对目标词的语义进行分析,确定“苗族传统服饰”的语义是用于描述民族风格的,因而可以将民族风格作为该商品的属性。这里的语义分析可以基于相似语义,以及概括语义等多种语义关系进行分析,具体来说,相似语义是指属性与目标词之间可以是具有相似的语义,概括语义是指属性与目标词之间可以是上下位的概念。As another possible implementation manner, the analysis may be performed based on the semantics of the target word to obtain the attribute of the target object. For example, the target word obtained by extracting the words in the detailed description page of the product may be “Miao tradition”. "Apparel", analyzes the semantics of the target words, and determines that the semantics of "Miao traditional costumes" is used to describe the national style, so the national style can be used as the attribute of the commodity. The semantic analysis here can be based on similar semantics and general semantics. A variety of semantic relationships are analyzed. Specifically, similar semantics means that attributes can have similar semantics with target words. Generalized semantics refers to the concept that attributes can be up and down between target words.
由于前述预设属性值和预设属性子值之间是具有语义相关性的,因而可以根据目标词所匹配的预设属性子值,进行查询获得该预设属性子值所对应的预设属性值,将该预设属性值作为商品的属性值,将该预设属性值对应的预设属性项作为商品的属性项。Because the foregoing preset attribute value and the preset attribute sub-value are semantically related, the preset attribute corresponding to the preset attribute sub-value can be obtained according to the preset attribute sub-value matched by the target word. The value is used as the attribute value of the item, and the preset attribute item corresponding to the preset attribute value is used as the attribute item of the item.
需要说明的是,在实际使用中还可以采用其他基于目标词的语义进行分析的方式,从而获得目标对象的属性,例如:采用数据挖掘中的分类器,该分类器是基于词汇的语义进行训练获得的。It should be noted that in the actual use, other methods based on the semantics of the target word can be used to obtain the attributes of the target object, for example, using a classifier in data mining, which is based on the semantics of vocabulary. acquired.
通过前述的属性获取方法,便可以通过原平台中商品的描述页面,获得商品在目标平台中的属性。图2为属性获取方法的应用场景示意图,如图2所示,左图为原平台中的商品页面,在页面中包括了商品标题和商品详情,对商品标题和商品详情进行抽取目标词,根据所抽取的目标词获得如右图所示的商品属性列表,该商品属性列表可以用于进行商品的筛选使用。其中,商品属性包括了商品属性项和商品的属性值,第一列为商品的属性项,第二列为商品的属性值。Through the foregoing attribute acquisition method, the attribute of the product in the target platform can be obtained through the description page of the product in the original platform. 2 is a schematic diagram of an application scenario of the attribute acquisition method. As shown in FIG. 2, the left picture is a product page in the original platform, and the product title and product details are included in the page, and the target word is extracted from the product title and the product details, according to The extracted target word obtains a list of product attributes as shown in the right figure, and the item attribute list can be used for screening the items. The product attribute includes the item attribute item and the attribute value of the item, the first column is the attribute item of the item, and the second column is the attribute value of the item.
实施例二Embodiment 2
本实施例中具体针对电子商务应用场景中,原平台接入目标平台时,对于如何获取原平台中的商品在目标平台中的属性进行了详细说明,图3为本发明实施例二提供的一种属性获取方法的流程示意图,如图3所示,包括:In this embodiment, specifically, in the e-commerce application scenario, when the original platform accesses the target platform, how to obtain the attributes of the products in the original platform in the target platform is described in detail. FIG. 3 is a schematic diagram of Embodiment 2 of the present invention. A schematic diagram of a process for obtaining an attribute, as shown in FIG. 3, includes:
步骤201、基于原平台中用于描述目标商品的非结构化文本,对目标商品在目标平台所属的类目进行预测。Step 201: Based on the unstructured text used to describe the target commodity in the original platform, predict the target product to be in the category to which the target platform belongs.
具体来说,可以首先预先构建一个分类模型,例如分类模型可以是简单贝叶斯算法分类模型。通过收集用户进行搜索的关键字和搜索之后的点击数据,根据点击数据中搜索之后被点击商品的类目,确定各关键字对应的类目,得到关键字和类目的对应关系。进而对关键字做分词,获得词条,将词条替代关键字和类目的对应关系中的关键字,获得词条和类目的对应关系。将词条和类目的对应关系作为训练集,对分类模型进行训练,分类模型进行训练,完成分类模型的构建。Specifically, a classification model may be constructed in advance, for example, the classification model may be a simple Bayesian algorithm classification model. By collecting the keyword searched by the user and the click data after the search, the category corresponding to each keyword is determined according to the category of the clicked product after the search in the click data, and the correspondence between the keyword and the category is obtained. Then, the keyword is segmented, the term is obtained, and the term is replaced by the keyword in the correspondence between the keyword and the category, and the correspondence between the term and the category is obtained. The correspondence between the entry and the category is used as the training set, the classification model is trained, the classification model is trained, and the classification model is constructed.
然后,基于所述目标对象的非结构化文本,采用经过训练的分类模型进行数据挖掘, 获得所述目标对象在目标平台所属类目。其中,非结构化文本可以为标题和/或详情页描述。Then, based on the unstructured text of the target object, using a trained classification model for data mining, Obtain the category to which the target object belongs in the target platform. Among them, the unstructured text can be described as a title and/or a detail page.
例如:当银泰等第三方平台作为原平台需要接入淘宝这一目标平台时,可以对第三方平台中目标商品的标题进行分词得到标题的词条,进而对标题的词条进行词性标注,获得各词条的词性信息。利用丢词算法,根据词性信息对词条进行丢词处理,从而将目标商品标题中的一些干扰词进行丢弃,只保留产品词、修饰词、品牌词、时间季节词、促销词等。将所保留的词条输入已经训练好的分类模型,获得目标商品在淘宝平台的类目。For example, when a third-party platform such as Intime needs to access the target platform of Taobao as the original platform, the title of the target product in the third-party platform can be segmented to obtain the title of the title, and then the title of the title is marked with the part of speech. The part of speech information of each entry. Using the word-loss algorithm, the words are processed according to the part of speech information, so that some of the interference words in the target product title are discarded, and only product words, modifiers, brand words, time-season words, promotional words, and the like are retained. Enter the retained terms into the trained classification model to obtain the categories of the target products on the Taobao platform.
由于在不同的平台中,类目的划分往往是不同的,因此,可以基于预测方式,获得目标商品在目标平台中所属的准确类目,从而便于基于该类目的预设属性匹配获得目标词,提高获取到的目标词中存在目标商品属性的可能性。Since the classification of the categories is often different in different platforms, the accurate categories of the target products in the target platform can be obtained based on the prediction mode, so that the target attributes can be obtained by matching the preset attributes based on the category. , to improve the possibility of the target product attribute in the obtained target word.
步骤202、从非结构化文本中抽取与所预测的类目下的预设属性匹配的目标词。Step 202: Extract a target word that matches the preset attribute under the predicted category from the unstructured text.
具体的,对经过预处理的非结构化文本进行相似度计算,获得与预设属性匹配的目标词,以及匹配度。为了便于描述将匹配度可以记为sim1。其中,匹配度用于描述目标词与预设属性的相似程度。Specifically, the pre-processed unstructured text is subjected to similarity calculation, and the target word matching the preset attribute is obtained, and the matching degree is obtained. For the convenience of description, the matching degree can be written as sim1. The matching degree is used to describe the degree of similarity between the target word and the preset attribute.
在预设属性中包括两部分,分别为属性项和属性值,若目标词与预设属性中的属性值相似,则称目标词与预设属性匹配,可以将目标词与匹配的属性中的属性项组合形成属性对记为PV。The preset attribute includes two parts, namely an attribute item and an attribute value. If the target word is similar to the attribute value in the preset attribute, the target word is matched with the preset attribute, and the target word and the matched attribute may be The attribute item combination forms an attribute pair as PV.
步骤203、根据目标词的匹配度从目标词中确定所述目标对象在目标平台中的属性和候选属性。Step 203: Determine, according to the matching degree of the target word, the attribute and the candidate attribute of the target object in the target platform from the target word.
例如:将相似度sim5大于预设阈值a的目标词,作为目标对象在目标平台中的属性;将相似度小于预设阈值a,且大于预设阈值b的目标词,作为候选属性。其中,0<b<a<1。For example, the target word whose similarity sim5 is greater than the preset threshold a is used as the attribute of the target object in the target platform; the target word whose similarity is smaller than the preset threshold a and larger than the preset threshold b is used as the candidate attribute. Where 0<b<a<1.
步骤204、针对确定为属性的目标词,在数据库中匹配所存储的目标平台的商品,提取匹配中的候选商品的属性。Step 204: Match the stored target platform products in the database for the target words determined as attributes, and extract the attributes of the candidate products in the matching.
具体的,数据库包括产品库和商品库,产品库与商品库相比不包含商家这一字段,其余数据可以是完全相同的。也就是说产品库中每一条记录对应一种产品,商品库中每一条记录对应一个商家提供的一种产品。Specifically, the database includes a product library and a commodity library, and the product library does not include the merchant field compared with the commodity library, and the remaining data may be identical. That is to say, each record in the product library corresponds to one product, and each record in the product library corresponds to one product provided by one merchant.
首先,在产品库中进行查询,经过查询获得产品库中与确定为属性的全部目标词均匹配中的候选商品。First, the query is performed in the product library, and the candidate products in the product library that match all the target words determined as attributes are obtained through the query.
然后,在商品库中进行查询,经过查询获得商品库中与确定为属性的全部目标词均 匹配中的候选商品。Then, the query is performed in the product library, and all the target words in the product library and determined as attributes are obtained through the query. The candidate in the match.
将两次查询所获得的全部候选商品的属性作为目标商品的属性,进而计算各属性的置信度。The attributes of all the candidate products obtained by the two queries are used as the attributes of the target item, and the confidence of each attribute is calculated.
步骤205、计算候选商品的各属性的置信度。Step 205: Calculate a confidence level of each attribute of the candidate item.
其中,置信度用来指示在目标平台中描述目标商品的准确程度。Among them, the confidence level is used to indicate the accuracy of describing the target item in the target platform.
若确定为属性的目标词包含品牌和型号时,且候选商品唯一时,则可以直接设置候选商品的各属性置信度为100%,也可以带入下面提及的置信度计算公式进行计算,结果是相同的。置信度计算公式如下:If it is determined that the target word of the attribute includes the brand and the model, and the candidate product is unique, the confidence value of each attribute of the candidate product may be directly set to 100%, or may be calculated by the confidence calculation formula mentioned below, and the result is calculated. Are the same. The confidence calculation formula is as follows:
置信度=(在候选商品的属性中的出现次数/候选商品总数)%Confidence = (number of occurrences in the attributes of the candidate products / total number of candidate items) %
例如:E.g:
目标词构成的属性对为:P1V1和P2V2The attribute pairs formed by the target words are: P1V1 and P2V2
在商品库中若存在匹配的候选商品有3个,候选商品的PV对分别为:If there are 3 matching candidate products in the product library, the PV pairs of the candidate products are:
P1V1、P2V2、P3V3、P6V6P1V1, P2V2, P3V3, P6V6
P1V1、P2V2、P7V7P1V1, P2V2, P7V7
P1V1、P2V2、P8V8P1V1, P2V2, P8V8
则输出P1V1、P2V2、P3V3、P7V7、P8V8作为目标商品的属性。Then, P1V1, P2V2, P3V3, P7V7, and P8V8 are output as attributes of the target item.
进而根据置信度公式,计算P1V1、P2V2、P3V3、P7V7、P8V8的置信度,分别为100%、100%、33.3%、33.3%、33.3%。Further, according to the confidence formula, the confidence levels of P1V1, P2V2, P3V3, P7V7, and P8V8 are calculated, which are 100%, 100%, 33.3%, 33.3%, and 33.3%, respectively.
步骤206、针对确定为候选属性的目标词,采用语义判别方式,确定候选属性为目标平台中的属性的置信度。Step 206: Determine, for a target word determined as a candidate attribute, a semantic discriminant manner to determine a candidate attribute as a confidence level of the attribute in the target platform.
首先,基于字与字之间的关系,进行语义判别。预先将目标平台中各预设属性值按照字进行分隔,作为训练文本,采用word2vec算法进行模型训练,将确定为候选属性的目标词输入训练好的判别模型,获得字向量,对字向量进行累加,获得词向量,采用词向量的余弦值作为候选属性为目标平台中的属性的置信度sim2。First, based on the relationship between words and words, semantic discrimination is performed. The preset attribute values in the target platform are separated according to words in advance. As the training text, the word2vec algorithm is used for model training, and the target words determined as candidate attributes are input into the trained discriminant model to obtain the word vector, and the word vectors are accumulated. The word vector is obtained, and the cosine value of the word vector is used as the candidate attribute as the confidence sim2 of the attribute in the target platform.
其次,基于目标词在非结构化文本中的上下文,进行语义判别。预先将目标平台中各商品的标题或者详情页作为语料,进行分词,将分词结果其作为训练文本,采用word2vec算法进行模型训练,将确定为候选属性的目标词输入训练好的判别模型,获得词向量,采用词向量的余弦值作为候选属性为目标平台中的属性的置信度sim3。Second, semantic discrimination is based on the context of the target word in the unstructured text. The title or detail page of each product in the target platform is used as a corpus, and the word segmentation result is used as the training text. The word2vec algorithm is used for model training, and the target word determined as the candidate attribute is input into the trained discriminant model to obtain the word. The vector takes the cosine of the word vector as the candidate attribute as the confidence sim3 of the attribute in the target platform.
最后,根据两种语义判别方式所获得的相似度sim2和sim3确定候选属性为目标平台中的属性的置信度S。例如:采用对sim2和sim3进行加权求和或加权平均的方式计 算置信度S。Finally, the similarities sim2 and sim3 obtained according to the two semantic discriminating methods determine the candidate attribute as the confidence S of the attribute in the target platform. For example: using a weighted sum or weighted average of sim2 and sim3 Calculate the confidence S.
作为一种可能的实现方式,可以针对计算出置信度S,参考上一步骤中候选商品,统计各个候选属性在候选商品的属性中出现的频率对计算出的置信度进行修正,获得修正后的置信度S。As a possible implementation manner, the calculated confidence level may be corrected by calculating the confidence S, referring to the candidate products in the previous step, and counting the frequency of occurrence of each candidate attribute in the attributes of the candidate product, and obtaining the corrected confidence. Confidence S.
步骤207、汇总确定为属性和候选属性的目标词,以及候选商品的属性,根据置信度从汇总结果中确定目标商品的属性。Step 207: Collect the target words determined as the attribute and the candidate attribute, and the attributes of the candidate item, and determine the attribute of the target item from the summary result according to the confidence.
可以根据属性获取所需的准确度,确定置信度的阈值。所需的准确度越高,则可以相应调高置信度阈值,所需的准确度若较低,可以设置较低的置信度阈值。从汇总结果中选取出置信度大于置信度阈值的目标词作为目标商品的属性。The threshold of confidence can be determined by obtaining the required accuracy based on the attributes. The higher the accuracy required, the higher the confidence threshold can be raised, and if the required accuracy is lower, a lower confidence threshold can be set. The target word with a confidence greater than the confidence threshold is selected from the summary results as the attribute of the target commodity.
实施例三Embodiment 3
图4为本发明实施例三提供的一种属性获取装置的结构示意图,如图4所示,包括:抽取模块31和确定模块32。4 is a schematic structural diagram of an attribute obtaining apparatus according to Embodiment 3 of the present invention. As shown in FIG. 4, the method includes: an extracting module 31 and a determining module 32.
抽取模块31,用于从用于描述目标对象的非结构化文本中,抽取出与预设属性匹配的目标词;The extracting module 31 is configured to extract, from the unstructured text used to describe the target object, a target word that matches the preset attribute;
具体的,抽取模块31具体用于采用相似度算法,对所述非结构化文本与所述预设属性进行字符串匹配,获得匹配的目标词与对应匹配度。Specifically, the extraction module 31 is specifically configured to perform a string matching on the unstructured text and the preset attribute by using a similarity algorithm to obtain a matching target word and a corresponding matching degree.
确定模块32,用于根据所述目标词确定所述目标对象的属性预设属性。The determining module 32 is configured to determine an attribute preset attribute of the target object according to the target word.
具体的,确定模块32,具体用于根据所述目标词与所述预设属性的匹配度,从所述目标词中确定所述目标对象的属性。Specifically, the determining module 32 is specifically configured to determine an attribute of the target object from the target words according to a matching degree between the target word and the preset attribute.
或者,具体的,确定模块32,具体用于基于所述目标词的语义进行分析,获得所述目标对象的属性。Or, specifically, the determining module 32 is specifically configured to perform an analysis based on the semantics of the target word to obtain an attribute of the target object.
本实施例中,通过从原平台用于描述目标对象的非结构化文本中,抽取出与目标平台的预设属性匹配的目标词,进而根据目标词确定出目标对象在目标平台中的属性的方案,可以实现从商品的标题和详情描述这种非结构化文本中抽取出商品的属性,因此解决了现有技术中无法针对非结构化文本进行处理,获得原平台的商品在目标平台上的属性的技术问题。In this embodiment, the target word matching the preset attribute of the target platform is extracted from the unstructured text used to describe the target object by the original platform, and then the attribute of the target object in the target platform is determined according to the target word. The solution can realize the extraction of the attributes of the goods from the unstructured text of the title and the detailed description of the product, thereby solving the problem that the prior art cannot process the unstructured text and obtain the original platform on the target platform. Technical issues of the property.
实施例四Embodiment 4
图5为本发明实施例四提供的一种属性获取装置的结构示意图,在图4所提供的属 性获取装置的基础上,确定模块32进一步包括:第一确定单元321和第二确定单元322。FIG. 5 is a schematic structural diagram of an attribute obtaining apparatus according to Embodiment 4 of the present invention, and the genus provided in FIG. 4 Based on the sexual acquisition device, the determination module 32 further includes: a first determination unit 321 and a second determination unit 322.
第一确定单元321,用于对于匹配度高于第一阈值的目标词,确定为所述目标对象在目标平台中的属性。The first determining unit 321 is configured to determine, as a target word whose matching degree is higher than the first threshold, an attribute of the target object in the target platform.
第二确定单元322,用于对于匹配度高于第二阈值但小于所述第一阈值的目标词作为候选属性,采用语义判别方式确定所述候选属性是否为所述目标平台中的属性,根据判别结果从所述候选属性中确定所述目标对象在目标平台中的属性。The second determining unit 322 is configured to determine, by using a semantic discriminant manner, whether the candidate attribute is an attribute in the target platform, for a target word whose matching degree is higher than a second threshold but smaller than the first threshold, as a candidate attribute, according to The discriminating result determines an attribute of the target object in the target platform from the candidate attributes.
进一步,第二确定单元322,可以包括:第一判别子单元3221和第二判别子单元3222中的至少一个。作为一种可能的实现方式的示意,图4中第二确定单元322包括了第一判别子单元3221和第二判别子单元3222。Further, the second determining unit 322 may include at least one of the first discriminating subunit 3221 and the second discriminating subunit 3222. As a schematic representation of one possible implementation, the second determining unit 322 in FIG. 4 includes a first discriminating subunit 3221 and a second discriminating subunit 3222.
其中,第一判别子单元3221,用于基于所述候选属性中字与字之间的关系,进行语义判别,获得所述候选属性为所述目标平台中的属性的置信度。The first discriminating sub-unit 3221 is configured to perform semantic discriminating based on the relationship between words and words in the candidate attribute, and obtain the confidence that the candidate attribute is an attribute in the target platform.
具体的,第一判别子单元3221具体用于将所述候选属性中的各字符输入预先训练的字间语义判别模型,获得字向量;所述字间语义判别模型,是将所述目标平台的属性中各字符作为训练文本进行训练获得的;对所述字向量进行累加,获得第一词向量;将所述第一词向量的余弦值作为所述候选属性为所述目标平台中的属性的置信度。Specifically, the first discriminating sub-unit 3221 is specifically configured to input each character in the candidate attribute into a pre-trained inter-word semantic discriminant model to obtain a word vector; the inter-word semantic discriminant model is to use the target platform Each character in the attribute is obtained by training as a training text; accumulating the word vector to obtain a first word vector; using a cosine value of the first word vector as the candidate attribute as an attribute in the target platform Confidence.
第二判别子单元3222,用于基于所述候选属性在所述非结构化文本中的上下文关系,进行语义判别,获得所述候选属性为所述目标平台中的属性的置信度。The second discriminating sub-unit 3222 is configured to perform semantic discriminating based on the context relationship of the candidate attribute in the unstructured text, and obtain the confidence that the candidate attribute is an attribute in the target platform.
具体的,第二判别子单元3222,具体用于将所述非结构化文本中的各单词输入预先训练的词间语义判别模型,获得第二词向量;所述词间语义判别模型,是将所述目标平台中非结构化文本中的各单词作为训练文本进行训练获得的;将所述第二词向量的余弦值作为所述候选属性为所述目标平台中的属性的置信度。Specifically, the second discriminating sub-unit 3222 is specifically configured to input each word in the unstructured text into a pre-trained inter-word semantic discriminant model to obtain a second word vector; the inter-word semantic discriminant model is Each word in the unstructured text in the target platform is trained as training text; and a cosine value of the second word vector is used as the candidate attribute as a confidence level of an attribute in the target platform.
进一步,第二确定单元322还可以包括:属性确定子单元3223。Further, the second determining unit 322 may further include: an attribute determining subunit 3223.
属性确定子单元3223,用于根据所述置信度,从所述候选属性中确定所述目标对象在目标平台中的属性。The attribute determining sub-unit 3223 is configured to determine, according to the confidence, an attribute of the target object in the target platform from the candidate attributes.
进一步,确定模块32,还包括:匹配单元323。Further, the determining module 32 further includes: a matching unit 323.
匹配单元323,用于将所述匹配度高于第一阈值的目标词与数据库中存储的所述目标平台中各对象的属性进行匹配,获得匹配中的候选对象;根据各候选对象的属性在全部候选对象的属性中出现的频率,计算候选对象的属性为所述目标对象在目标平台中的属性的概率;根据所计算出的概率,从所述候选对象的属性中确定所述目标对象在目标平台中的属性。 The matching unit 323 is configured to match the target word whose matching degree is higher than the first threshold with the attributes of each object in the target platform stored in the database, to obtain candidate objects in the matching; according to the attributes of each candidate object a frequency of occurrence of attributes of all candidate objects, calculating a probability that an attribute of the candidate object is an attribute of the target object in the target platform; determining, according to the calculated probability, the target object from among attributes of the candidate object The attributes in the target platform.
进一步,本实施例所提供的属性获取装置,还包括:类目预测模块33和预设属性确定模块34。Further, the attribute obtaining apparatus provided in this embodiment further includes: a category prediction module 33 and a preset attribute determining module 34.
类目预测模块33,用于根据所述非结构化文本预测所述目标对象在目标平台所属类目。The category prediction module 33 is configured to predict, according to the unstructured text, the category of the target object in the target platform.
预设属性确定模块34,用于将所述目标平台中所述类目下的属性作为所述预设属性。The preset attribute determining module 34 is configured to use an attribute under the category in the target platform as the preset attribute.
其中,类目预测模块33,包括:挖掘单元331和建模单元332。The category prediction module 33 includes: a mining unit 331 and a modeling unit 332.
挖掘单元331,用于基于所述目标对象的非结构化文本,采用经过训练的分类模型进行数据挖掘,获得所述目标对象在目标平台所属类目。The mining unit 331 is configured to perform data mining using the trained classification model based on the unstructured text of the target object, and obtain the category of the target object in the target platform.
建模单元332,用于获取用户搜索关键字以及从搜索结果中所选定的对象所属类目;对所述关键字进行分词处理,获得搜索词条;根据所述搜索词条和所选定的对象所属类目生成训练集;利用所述训练集对所述分类模型进行训练。The modeling unit 332 is configured to acquire a user search keyword and a category to which the selected object is selected from the search result; perform word segmentation processing on the keyword to obtain a search term; according to the search term and the selected item The category to which the object belongs generates a training set; the training model is used to train the classification model.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。One of ordinary skill in the art will appreciate that all or part of the steps to implement the various method embodiments described above may be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims (26)

  1. 一种属性获取方法,其特征在于,包括:An attribute acquisition method, comprising:
    从用于描述目标对象的非结构化文本中,抽取出与预设属性匹配的目标词;Extracting a target word that matches the preset attribute from the unstructured text used to describe the target object;
    根据所述目标词确定所述目标对象的属性。Determining an attribute of the target object according to the target word.
  2. 根据权利要求1所述的属性获取方法,其特征在于,所述从用于描述目标对象的非结构化文本中,抽取出与预设属性匹配的目标词,包括:The attribute acquisition method according to claim 1, wherein the extracting the target word matching the preset attribute from the unstructured text for describing the target object comprises:
    采用相似度算法,对所述非结构化文本与所述预设属性进行字符串匹配,获得匹配的目标词与对应匹配度。The similarity algorithm is used to perform string matching on the unstructured text and the preset attribute to obtain a matching target word and a corresponding matching degree.
  3. 根据权利要求1所述的属性获取方法,其特征在于,所述根据所述目标词确定所述目标对象的属性,包括:The attribute obtaining method according to claim 1, wherein the determining the attribute of the target object according to the target word comprises:
    根据所述目标词与所述预设属性的匹配度,从所述目标词中确定所述目标对象的属性。Determining an attribute of the target object from the target word according to a matching degree of the target word and the preset attribute.
  4. 根据权利要求1所述的属性获取方法,其特征在于,所述根据所述目标词确定所述目标对象的属性,包括:The attribute obtaining method according to claim 1, wherein the determining the attribute of the target object according to the target word comprises:
    基于所述目标词的语义进行分析,获得所述目标对象的属性。Performing an analysis based on the semantics of the target word to obtain an attribute of the target object.
  5. 根据权利要求3所述的属性获取方法,其特征在于,所述根据所述目标词与所述预设属性的匹配度,从所述目标词中确定所述目标对象的属性,包括:The attribute obtaining method according to claim 3, wherein the determining the attribute of the target object from the target word according to the matching degree of the target word and the preset attribute comprises:
    对于匹配度高于第一阈值的目标词,确定为所述目标对象在目标平台中的属性;For the target word whose matching degree is higher than the first threshold, determining an attribute of the target object in the target platform;
    对于匹配度高于第二阈值但小于所述第一阈值的目标词作为候选属性,采用语义判别方式确定所述候选属性是否为所述目标平台中的属性,根据判别结果从所述候选属性中确定所述目标对象在目标平台中的属性。And determining, by using a semantic discriminant manner, whether the candidate attribute is an attribute in the target platform, and using the target word whose matching degree is higher than the second threshold but smaller than the first threshold as a candidate attribute, according to the discriminating result, from the candidate attribute. Determining the attributes of the target object in the target platform.
  6. 根据权利要求5所述的属性获取方法,其特征在于,所述采用语义判别方式确定所述候选属性是否为所述目标平台中的属性,包括:The attribute obtaining method according to claim 5, wherein the determining, by the semantic discriminating manner, whether the candidate attribute is an attribute in the target platform comprises:
    基于所述候选属性中字与字之间的关系,进行语义判别,获得所述候选属性为所述目标平台中的属性的置信度;Performing semantic discrimination based on the relationship between the words and the words in the candidate attributes, and obtaining the confidence that the candidate attributes are attributes in the target platform;
    和/或,基于所述候选属性在所述非结构化文本中的上下文关系,进行语义判别,获得所述候选属性为所述目标平台中的属性的置信度。And/or performing semantic discrimination based on a context relationship of the candidate attribute in the unstructured text, obtaining a confidence that the candidate attribute is an attribute in the target platform.
  7. 根据权利要求6所述的属性获取方法,其特征在于,所述基于所述候选属性中字与字之间的关系,进行语义判别,包括:The attribute obtaining method according to claim 6, wherein the semantic discrimination is performed based on a relationship between a word and a word in the candidate attribute, including:
    将所述候选属性中的各字符输入预先训练的字间语义判别模型,获得字向量;所述 字间语义判别模型,是将所述目标平台的属性中各字符作为训练文本进行训练获得的;Inputting each character in the candidate attribute into a pre-trained inter-word semantic discriminant model to obtain a word vector; The semantic discrimination model between words is obtained by training each character in the attribute of the target platform as a training text;
    对所述字向量进行累加,获得第一词向量;Accumulating the word vectors to obtain a first word vector;
    将所述第一词向量的余弦值作为所述候选属性为所述目标平台中的属性的置信度。The cosine value of the first word vector is used as the candidate attribute as a confidence level of the attribute in the target platform.
  8. 根据权利要求6所述的属性获取方法,其特征在于,所述基于所述候选属性在所述非结构化文本中的上下文关系,进行语义判别,包括:The attribute obtaining method according to claim 6, wherein the semantic determination based on the context relationship of the candidate attribute in the unstructured text comprises:
    将所述非结构化文本中的各单词输入预先训练的词间语义判别模型,获得第二词向量;所述词间语义判别模型,是将所述目标平台中非结构化文本中的各单词作为训练文本进行训练获得的;Inputting each word in the unstructured text into a pre-trained inter-word semantic discriminant model to obtain a second word vector; the inter-word semantic discriminant model is to use each word in the unstructured text in the target platform Obtained as training text;
    将所述第二词向量的余弦值作为所述候选属性为所述目标平台中的属性的置信度。The cosine value of the second word vector is used as the candidate attribute as a confidence level of the attribute in the target platform.
  9. 根据权利要求6所述的属性获取方法,其特征在于,所述根据判别结果从所述候选属性中确定所述目标对象在目标平台中的属性,包括:The attribute acquisition method according to claim 6, wherein the determining, according to the determination result, the attribute of the target object in the target platform from the candidate attributes comprises:
    根据所述置信度,从所述候选属性中确定所述目标对象在目标平台中的属性。And determining, according to the confidence, an attribute of the target object in the target platform from the candidate attributes.
  10. 根据权利要求5所述的属性获取方法,其特征在于,所述对于匹配度高于第一阈值的目标词,确定为所述目标对象在目标平台中的属性之后,还包括:The attribute acquisition method according to claim 5, wherein the determining, after the target object is in the target platform, the target word with the matching degree being higher than the first threshold, the method further includes:
    将所述匹配度高于第一阈值的目标词与数据库中存储的所述目标平台中各对象的属性进行匹配,获得匹配中的候选对象;Matching the target words whose matching degree is higher than the first threshold with the attributes of each object in the target platform stored in the database, to obtain candidate objects in the matching;
    根据各候选对象的属性在全部候选对象的属性中出现的频率,计算候选对象的属性为所述目标对象在目标平台中的属性的概率;Calculating a probability that an attribute of the candidate object is an attribute of the target object in the target platform according to a frequency of occurrence of an attribute of each candidate object in an attribute of all candidate objects;
    根据所计算出的概率,从所述候选对象的属性中确定所述目标对象在目标平台中的属性。And determining, according to the calculated probability, an attribute of the target object in the target platform from attributes of the candidate object.
  11. 根据权利要求1-10任一项所述的属性获取方法,其特征在于,所述从用于描述目标对象的非结构化文本中,抽取出与预设属性匹配的目标词之前,还包括:The attribute obtaining method according to any one of claims 1 to 10, wherein the extracting the target word matching the preset attribute from the unstructured text for describing the target object further includes:
    根据所述非结构化文本预测所述目标对象在目标平台所属类目;Predicting, according to the unstructured text, the category of the target object in the target platform;
    将所述目标平台中所述类目下的属性作为所述预设属性。The attribute under the category in the target platform is used as the preset attribute.
  12. 根据权利要求11所述的属性获取方法,其特征在于,所述根据所述非结构化文本预测所述目标对象在目标平台所属类目,包括:The attribute obtaining method according to claim 11, wherein the predicting the target object in the target platform according to the unstructured text comprises:
    基于所述目标对象的非结构化文本,采用经过训练的分类模型进行数据挖掘,获得所述目标对象在目标平台所属类目。Based on the unstructured text of the target object, the trained classification model is used for data mining, and the target object belongs to the category to which the target platform belongs.
  13. 根据权利要求12所述的属性获取方法,其特征在于,所述采用经过训练的分类模型进行数据挖掘之前,还包括: The attribute acquisition method according to claim 12, wherein before the data mining using the trained classification model, the method further comprises:
    获取用户搜索关键字以及从搜索结果中所选定的对象所属类目;Obtaining a user search keyword and a category to which the object selected from the search results belongs;
    对所述关键字进行分词处理,获得搜索词条;Performing word segmentation on the keyword to obtain a search term;
    根据所述搜索词条和所选定的对象所属类目生成训练集;Generating a training set according to the search term and the category to which the selected object belongs;
    利用所述训练集对所述分类模型进行训练。The classification model is trained using the training set.
  14. 一种属性获取装置,其特征在于,包括:An attribute obtaining device, comprising:
    抽取模块,用于从用于描述目标对象的非结构化文本中,抽取出与预设属性匹配的目标词;An extraction module, configured to extract, from the unstructured text used to describe the target object, a target word that matches the preset attribute;
    确定模块,用于根据所述目标词确定所述目标对象的属性。a determining module, configured to determine an attribute of the target object according to the target word.
  15. 根据权利要求14所述的属性获取装置,其特征在于,The attribute acquisition device according to claim 14, wherein
    所述抽取模块,具体用于采用相似度算法,对所述非结构化文本与所述预设属性进行字符串匹配,获得匹配的目标词与对应匹配度。The extracting module is specifically configured to perform a string matching on the unstructured text and the preset attribute by using a similarity algorithm to obtain a matching target word and a corresponding matching degree.
  16. 根据权利要求14所述的属性获取装置,其特征在于,The attribute acquisition device according to claim 14, wherein
    所述确定模块,具体用于根据所述目标词与所述预设属性的匹配度,从所述目标词中确定所述目标对象的属性。The determining module is specifically configured to determine an attribute of the target object from the target word according to a matching degree of the target word and the preset attribute.
  17. 根据权利要求14所述的属性获取装置,其特征在于,The attribute acquisition device according to claim 14, wherein
    所述确定模块,具体用于基于所述目标词的语义进行分析,获得所述目标对象的属性。The determining module is specifically configured to perform an analysis based on semantics of the target word to obtain an attribute of the target object.
  18. 根据权利要求16所述的属性获取装置,其特征在于,所述确定模块,包括:The attribute obtaining apparatus according to claim 16, wherein the determining module comprises:
    第一确定单元,用于对于匹配度高于第一阈值的目标词,确定为所述目标对象在目标平台中的属性;a first determining unit, configured to determine, as a target word whose matching degree is higher than the first threshold, an attribute of the target object in the target platform;
    第二确定单元,用于对于匹配度高于第二阈值但小于所述第一阈值的目标词作为候选属性,采用语义判别方式确定所述候选属性是否为所述目标平台中的属性,根据判别结果从所述候选属性中确定所述目标对象在目标平台中的属性。a second determining unit, configured to determine, as a candidate attribute, a target word whose matching degree is higher than a second threshold but smaller than the first threshold, and determine, by using a semantic discriminating manner, whether the candidate attribute is an attribute in the target platform, according to the determining As a result, an attribute of the target object in the target platform is determined from the candidate attributes.
  19. 根据权利要求18所述的属性获取装置,其特征在于,所述第二确定单元,包括:The attribute obtaining apparatus according to claim 18, wherein the second determining unit comprises:
    第一判别子单元,用于基于所述候选属性中字与字之间的关系,进行语义判别,获得所述候选属性为所述目标平台中的属性的置信度;a first discriminating subunit, configured to perform semantic discriminating based on a relationship between words and words in the candidate attribute, and obtain a confidence that the candidate attribute is an attribute in the target platform;
    和/或,第二判别子单元,用于基于所述候选属性在所述非结构化文本中的上下文关系,进行语义判别,获得所述候选属性为所述目标平台中的属性的置信度。And/or a second discriminating subunit, configured to perform semantic discriminating based on a context relationship of the candidate attribute in the unstructured text, and obtain a confidence that the candidate attribute is an attribute in the target platform.
  20. 根据权利要求19所述的属性获取装置,其特征在于, The attribute acquisition device according to claim 19, wherein
    所述第一判别子单元,具体用于将所述候选属性中的各字符输入预先训练的字间语义判别模型,获得字向量;所述字间语义判别模型,是将所述目标平台的属性中各字符作为训练文本进行训练获得的;对所述字向量进行累加,获得第一词向量;将所述第一词向量的余弦值作为所述候选属性为所述目标平台中的属性的置信度。The first discriminating subunit is specifically configured to input each character in the candidate attribute into a pre-trained inter-word semantic discriminant model to obtain a word vector; the inter-word semantic discriminant model is an attribute of the target platform Each character is trained as training text; accumulating the word vector to obtain a first word vector; using a cosine value of the first word vector as the candidate attribute as a confidence in an attribute in the target platform degree.
  21. 根据权利要求19所述的属性获取装置,其特征在于,The attribute acquisition device according to claim 19, wherein
    所述第二判别子单元,具体用于将所述非结构化文本中的各单词输入预先训练的词间语义判别模型,获得第二词向量;所述词间语义判别模型,是将所述目标平台中非结构化文本中的各单词作为训练文本进行训练获得的;将所述第二词向量的余弦值作为所述候选属性为所述目标平台中的属性的置信度。The second discriminating subunit is specifically configured to input each word in the unstructured text into a pre-trained inter-word semantic discriminant model to obtain a second word vector; the inter-word semantic discriminant model is Each word in the unstructured text in the target platform is trained as training text; the cosine value of the second word vector is used as the candidate attribute as a confidence level of the attribute in the target platform.
  22. 根据权利要求19所述的属性获取装置,其特征在于,所述第二确定单元,还包括:The attribute obtaining apparatus according to claim 19, wherein the second determining unit further comprises:
    属性确定子单元,用于根据所述置信度,从所述候选属性中确定所述目标对象在目标平台中的属性。An attribute determining subunit, configured to determine an attribute of the target object in the target platform from the candidate attributes according to the confidence level.
  23. 根据权利要求18所述的属性获取装置,其特征在于,所述确定模块,还包括:The attribute obtaining apparatus according to claim 18, wherein the determining module further comprises:
    匹配单元,用于将所述匹配度高于第一阈值的目标词与数据库中存储的所述目标平台中各对象的属性进行匹配,获得匹配中的候选对象;根据各候选对象的属性在全部候选对象的属性中出现的频率,计算候选对象的属性为所述目标对象在目标平台中的属性的概率;根据所计算出的概率,从所述候选对象的属性中确定所述目标对象在目标平台中的属性。a matching unit, configured to match the target word with the matching degree higher than the first threshold with the attributes of each object in the target platform stored in the database, to obtain candidate objects in the matching; according to the attributes of each candidate object a frequency of occurrence of the candidate object, calculating a probability that the attribute of the candidate object is an attribute of the target object in the target platform; determining, according to the calculated probability, the target object from the attribute of the candidate object Attributes in the platform.
  24. 根据权利要求14-23任一项所述的属性获取装置,其特征在于,所述装置,还包括:The attribute obtaining device according to any one of claims 14 to 23, wherein the device further comprises:
    类目预测模块,用于根据所述非结构化文本预测所述目标对象在目标平台所属类目;a category prediction module, configured to predict, according to the unstructured text, the category of the target object in the target platform;
    预设属性确定模块,用于将所述目标平台中所述类目下的属性作为所述预设属性。And a preset attribute determining module, configured to use an attribute under the category in the target platform as the preset attribute.
  25. 根据权利要求24所述的属性获取装置,其特征在于,所述类目预测模块,包括:The attribute obtaining apparatus according to claim 24, wherein the category prediction module comprises:
    挖掘单元,用于基于所述目标对象的非结构化文本,采用经过训练的分类模型进行数据挖掘,获得所述目标对象在目标平台所属类目。The mining unit is configured to perform data mining by using the trained classification model based on the unstructured text of the target object, and obtain the category of the target object in the target platform.
  26. 根据权利要求25所述的属性获取装置,其特征在于,所述类目预测模块,还包括: The attribute obtaining apparatus according to claim 25, wherein the category prediction module further comprises:
    建模单元,用于获取用户搜索关键字以及从搜索结果中所选定的对象所属类目;对所述关键字进行分词处理,获得搜索词条;根据所述搜索词条和所选定的对象所属类目生成训练集;利用所述训练集对所述分类模型进行训练。 a modeling unit, configured to acquire a user search keyword and a category to which the selected object is selected from the search result; perform word segmentation processing on the keyword to obtain a search term; according to the search term and the selected The category to which the object belongs generates a training set; the classification model is trained using the training set.
PCT/CN2017/075829 2016-03-17 2017-03-07 Attribute acquisition method and device WO2017157198A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610154037.9 2016-03-17
CN201610154037.9A CN107203548A (en) 2016-03-17 2016-03-17 Attribute acquisition methods and device

Publications (1)

Publication Number Publication Date
WO2017157198A1 true WO2017157198A1 (en) 2017-09-21

Family

ID=59850988

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/075829 WO2017157198A1 (en) 2016-03-17 2017-03-07 Attribute acquisition method and device

Country Status (3)

Country Link
CN (1) CN107203548A (en)
TW (1) TW201734901A (en)
WO (1) WO2017157198A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263123A (en) * 2019-06-05 2019-09-20 腾讯科技(深圳)有限公司 Prediction technique, device and the computer equipment of mechanism name abbreviation
CN110807083A (en) * 2018-08-02 2020-02-18 北京京东尚科信息技术有限公司 Keyword evaluation method and device
CN110827063A (en) * 2019-10-18 2020-02-21 用友网络科技股份有限公司 Multi-strategy fused commodity recommendation method, device, terminal and storage medium
CN110874408A (en) * 2018-08-29 2020-03-10 阿里巴巴集团控股有限公司 Model training method, text recognition device and computing equipment
CN110955822A (en) * 2018-09-25 2020-04-03 北京京东尚科信息技术有限公司 Commodity searching method and device
CN111444334A (en) * 2019-01-16 2020-07-24 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN111444335A (en) * 2019-01-17 2020-07-24 阿里巴巴集团控股有限公司 Method and device for extracting central word
CN111860575A (en) * 2020-06-05 2020-10-30 百度在线网络技术(北京)有限公司 Method and device for processing article attribute information, electronic equipment and storage medium
CN112183035A (en) * 2020-11-06 2021-01-05 上海恒生聚源数据服务有限公司 Text labeling method, device and equipment and readable storage medium
CN112507702A (en) * 2020-12-03 2021-03-16 北京百度网讯科技有限公司 Text information extraction method and device, electronic equipment and storage medium
CN113627509A (en) * 2021-08-04 2021-11-09 口碑(上海)信息技术有限公司 Data classification method and device, computer equipment and computer readable storage medium
CN113722496A (en) * 2021-11-02 2021-11-30 北京世纪好未来教育科技有限公司 Triple extraction method and device, readable storage medium and electronic equipment
CN114201973A (en) * 2022-02-15 2022-03-18 深圳博士创新技术转移有限公司 Resource pool object data mining method and system based on artificial intelligence

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197180A (en) * 2017-12-25 2018-06-22 中山大学 A kind of method of the editable image of clothing retrieval of clothes attribute
CN110223095A (en) * 2018-03-02 2019-09-10 阿里巴巴集团控股有限公司 Determine the method, apparatus, equipment and storage medium of item property
CN109101595B (en) * 2018-07-27 2022-07-08 郑州云海信息技术有限公司 Information query method, device, equipment and computer readable storage medium
CN110807095A (en) * 2018-08-01 2020-02-18 北京京东尚科信息技术有限公司 Article matching method and device
CN109711951A (en) * 2019-01-18 2019-05-03 中合金网(北京)电子商务有限公司 Commodity automation collection and moving method
CN110175322A (en) * 2019-05-22 2019-08-27 北京神州泰岳软件股份有限公司 A kind of structural method and device of document
CN111797622B (en) * 2019-06-20 2024-04-09 北京沃东天骏信息技术有限公司 Method and device for generating attribute information
CN110334185A (en) * 2019-07-05 2019-10-15 政采云有限公司 The treating method and apparatus of data in a kind of platform
CN112800978A (en) * 2021-01-29 2021-05-14 北京金山云网络技术有限公司 Attribute recognition method, and training method and device for part attribute extraction network
CN113256379A (en) * 2021-05-24 2021-08-13 北京小米移动软件有限公司 Method for correlating shopping demands for commodities
CN113609279B (en) * 2021-08-05 2023-12-08 湖南特能博世科技有限公司 Material model extraction method and device and computer equipment
CN113724055B (en) * 2021-09-14 2024-04-09 京东科技信息技术有限公司 Commodity attribute mining method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235390A1 (en) * 2009-03-16 2010-09-16 Fujitsu Limited Search device, search method, and computer-readable recording medium storing search program
CN102073729A (en) * 2011-01-14 2011-05-25 百度在线网络技术(北京)有限公司 Relationship knowledge sharing platform and implementation method thereof
CN102375823A (en) * 2010-08-13 2012-03-14 腾讯科技(深圳)有限公司 Searching result gathering display method and system
CN103309886A (en) * 2012-03-13 2013-09-18 阿里巴巴集团控股有限公司 Trading-platform-based structural information searching method and device
CN104504138A (en) * 2014-12-31 2015-04-08 广州索答信息科技有限公司 Human-based information fusion method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473317A (en) * 2013-09-12 2013-12-25 百度在线网络技术(北京)有限公司 Method and equipment for extracting keywords
CN103605815B (en) * 2013-12-11 2016-08-31 焦点科技股份有限公司 A kind of merchandise news being applicable to B2B E-commerce platform is classified recommendation method automatically
CN104850554B (en) * 2014-02-14 2020-05-19 北京搜狗科技发展有限公司 Searching method and system
CN105005917A (en) * 2015-07-07 2015-10-28 上海晶赞科技发展有限公司 Universal method for correlating single items of different e-commerce websites

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235390A1 (en) * 2009-03-16 2010-09-16 Fujitsu Limited Search device, search method, and computer-readable recording medium storing search program
CN102375823A (en) * 2010-08-13 2012-03-14 腾讯科技(深圳)有限公司 Searching result gathering display method and system
CN102073729A (en) * 2011-01-14 2011-05-25 百度在线网络技术(北京)有限公司 Relationship knowledge sharing platform and implementation method thereof
CN103309886A (en) * 2012-03-13 2013-09-18 阿里巴巴集团控股有限公司 Trading-platform-based structural information searching method and device
CN104504138A (en) * 2014-12-31 2015-04-08 广州索答信息科技有限公司 Human-based information fusion method and device

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807083A (en) * 2018-08-02 2020-02-18 北京京东尚科信息技术有限公司 Keyword evaluation method and device
CN110874408B (en) * 2018-08-29 2023-05-26 阿里巴巴集团控股有限公司 Model training method, text recognition device and computing equipment
CN110874408A (en) * 2018-08-29 2020-03-10 阿里巴巴集团控股有限公司 Model training method, text recognition device and computing equipment
CN110955822B (en) * 2018-09-25 2024-02-06 北京京东尚科信息技术有限公司 Commodity searching method and device
CN110955822A (en) * 2018-09-25 2020-04-03 北京京东尚科信息技术有限公司 Commodity searching method and device
CN111444334B (en) * 2019-01-16 2023-04-25 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN111444334A (en) * 2019-01-16 2020-07-24 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN111444335A (en) * 2019-01-17 2020-07-24 阿里巴巴集团控股有限公司 Method and device for extracting central word
CN111444335B (en) * 2019-01-17 2023-04-07 阿里巴巴集团控股有限公司 Method and device for extracting central word
CN110263123B (en) * 2019-06-05 2023-10-31 腾讯科技(深圳)有限公司 Method and device for predicting organization name abbreviation and computer equipment
CN110263123A (en) * 2019-06-05 2019-09-20 腾讯科技(深圳)有限公司 Prediction technique, device and the computer equipment of mechanism name abbreviation
CN110827063A (en) * 2019-10-18 2020-02-21 用友网络科技股份有限公司 Multi-strategy fused commodity recommendation method, device, terminal and storage medium
CN111860575A (en) * 2020-06-05 2020-10-30 百度在线网络技术(北京)有限公司 Method and device for processing article attribute information, electronic equipment and storage medium
CN111860575B (en) * 2020-06-05 2023-06-16 百度在线网络技术(北京)有限公司 Method and device for processing object attribute information, electronic equipment and storage medium
CN112183035A (en) * 2020-11-06 2021-01-05 上海恒生聚源数据服务有限公司 Text labeling method, device and equipment and readable storage medium
CN112183035B (en) * 2020-11-06 2023-11-21 上海恒生聚源数据服务有限公司 Text labeling method, device, equipment and readable storage medium
CN112507702B (en) * 2020-12-03 2023-08-22 北京百度网讯科技有限公司 Text information extraction method and device, electronic equipment and storage medium
CN112507702A (en) * 2020-12-03 2021-03-16 北京百度网讯科技有限公司 Text information extraction method and device, electronic equipment and storage medium
CN113627509A (en) * 2021-08-04 2021-11-09 口碑(上海)信息技术有限公司 Data classification method and device, computer equipment and computer readable storage medium
CN113627509B (en) * 2021-08-04 2024-05-10 口碑(上海)信息技术有限公司 Data classification method, device, computer equipment and computer readable storage medium
CN113722496A (en) * 2021-11-02 2021-11-30 北京世纪好未来教育科技有限公司 Triple extraction method and device, readable storage medium and electronic equipment
CN114201973B (en) * 2022-02-15 2022-06-07 深圳博士创新技术转移有限公司 Resource pool object data mining method and system based on artificial intelligence
CN114201973A (en) * 2022-02-15 2022-03-18 深圳博士创新技术转移有限公司 Resource pool object data mining method and system based on artificial intelligence

Also Published As

Publication number Publication date
TW201734901A (en) 2017-10-01
CN107203548A (en) 2017-09-26

Similar Documents

Publication Publication Date Title
WO2017157198A1 (en) Attribute acquisition method and device
More Attribute extraction from product titles in ecommerce
Putthividhya et al. Bootstrapped named entity recognition for product attribute extraction
CN105760507B (en) Cross-module state topic relativity modeling method based on deep learning
WO2020253591A1 (en) Search method and apparatus applying tag knowledge network
CN107291723B (en) Method and device for classifying webpage texts and method and device for identifying webpage texts
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN106407180B (en) Entity disambiguation method and device
WO2016180270A1 (en) Webpage classification method and apparatus, calculation device and machine readable storage medium
US20130166303A1 (en) Accessing media data using metadata repository
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
WO2014040169A1 (en) Intelligent supplemental search engine optimization
Anke et al. Syntactically aware neural architectures for definition extraction
CN108509521B (en) Image retrieval method for automatically generating text index
CN109086375A (en) A kind of short text subject extraction method based on term vector enhancing
CN109960756A (en) Media event information inductive method
CN112417863A (en) Chinese text classification method based on pre-training word vector model and random forest algorithm
WO2018090468A1 (en) Method and device for searching for video program
TW201824027A (en) Method for verifying string, method for expanding string and method for training verification model
Kozareva et al. Recognizing salient entities in shopping queries
Ramakrishna et al. A quantitative analysis of gender differences in movies using psycholinguistic normatives
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
CN114971730A (en) Method for extracting file material, device, equipment, medium and product thereof
CN111523311B (en) Search intention recognition method and device
US10380151B2 (en) Information processing to search for related expressions

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17765740

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17765740

Country of ref document: EP

Kind code of ref document: A1