WO2017166912A1 - 商品短文本核心词提取方法和装置 - Google Patents

商品短文本核心词提取方法和装置 Download PDF

Info

Publication number
WO2017166912A1
WO2017166912A1 PCT/CN2017/072157 CN2017072157W WO2017166912A1 WO 2017166912 A1 WO2017166912 A1 WO 2017166912A1 CN 2017072157 W CN2017072157 W CN 2017072157W WO 2017166912 A1 WO2017166912 A1 WO 2017166912A1
Authority
WO
WIPO (PCT)
Prior art keywords
short text
commodity
weight
word
core
Prior art date
Application number
PCT/CN2017/072157
Other languages
English (en)
French (fr)
Inventor
高维国
陈海勇
Original Assignee
北京京东尚科信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东尚科信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京京东尚科信息技术有限公司
Priority to AU2017243270A priority Critical patent/AU2017243270B2/en
Priority to US16/089,579 priority patent/US11138250B2/en
Publication of WO2017166912A1 publication Critical patent/WO2017166912A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Definitions

  • the present invention relates to the field of big data processing, and in particular, to a method and apparatus for extracting a short text core word of a commodity.
  • e-commerce websites offer a wide variety of goods and very detailed product information, and more and more customers buy goods and comment on the Internet. Therefore, e-commerce websites generate a large amount of product title data and product review data. Early methods of mining core words by manual labeling have been unable to adapt to the need to mine core words from massive commodity data.
  • the word core model (BOW, Bag of Words) is generally used to automatically extract the core words of the text.
  • word bag model text is treated as an unordered collection of words, ignoring the grammar and even the order of the words. Therefore, this method works well when extracting the core words of long text, and it is not effective when used for short text. Compared with long text, short text contains fewer words, so it has the characteristics of sparse features and unclear themes. It is more difficult to accurately extract core words.
  • the invention provides a core word extraction method and device for commodity short text, which is used for improving the accuracy of extracting core words from short texts of commodities.
  • a method for extracting a short text core word of a commodity includes: acquiring short text of a commodity in a data set; performing word segmentation processing on the short text of the commodity; and obtaining the commodity according to the contextual context information of the short text of the commodity Document vector of short text; clustering short texts of commodities in the data set according to the document vector; determining cluster weights of each participle in the short text of the commodity in the category to which the short text of the commodity belongs; cluster weights according to the clustering level of each word segmentation Identify the core words of the short text of the product.
  • the method for extracting the commodity short text core word further comprises: determining product part-of-speech information of each participle in the short text of the commodity, and determining the part-of-speech weight corresponding to the word segmentation in the short text of the commodity according to the correspondence between the part of speech of the commodity and the weight of the part of speech.
  • obtaining the document vector of the short text of the commodity according to the word segmentation context information of the short text of the commodity comprises: determining a window parameter according to an average length of the short text of the commodity in the data set; using a vector operation tool word2vec, using the dataset as an input corpus, The determined window parameters are calculated as the window size, and the document vector of the short text of the product in the data set is obtained.
  • the number of clusters is determined based on the number of commodity categories.
  • determining a cluster level weight of each participle in the short text of the commodity in the category to which the short text of the commodity belongs includes: calculating a chi-square of the category of the short text in the short text of the commodity by using the chi-square formula The value, the chi-square value is used as the cluster level weight of each category in the short text of the product in the category to which the short text of the product belongs.
  • the following method is used to determine the correspondence between the part of speech and the part of speech weight: obtaining the training corpus, the training corpus includes a number of short texts for training, and labeling the core words in each short text of the product for training
  • the product part-of-speech information corresponding to the core word, the ratio of the number of core words having the same product part-of-speech information to the number of all core words in the training corpus is used as the part-of-speech weight corresponding to the product part-of-speech information; or, the search used when establishing the user search Correspondence between the word and the short text of the clicked product, the search term is marked as the core word of the corresponding short text of the product, and the product part-of-speech information corresponding to the core word is marked, and the number of core words having the same product part-of-speech information and the training corpus are The ratio of the number of all core words is used as the part of speech weight corresponding to the part of speech information of the commodity.
  • determining the local weight of each participle according to the part of speech weight corresponding to each participle in the short text of the commodity comprises: normalizing the part of speech weight corresponding to each participle in the short text of the commodity, and obtaining the local weight of each participle .
  • determining, according to at least one of a local weight of each word segment and a document level weight, and a cluster level weight, a core word of the commodity short text includes: combining at least one of a local weight of each word segment and a document level weight
  • the class-level weighted weighted summation is the core weight of each participle; the participle with the largest value of the core weight in the short text of the commodity is determined as the core word of the short text of the commodity.
  • the following method is used to determine at least one of the local weight and the document level weight and the weighting coefficient of the cluster level weight: word segmentation of each commodity short text in the training data set; labeling each item in the training data set Core words and non-core words of short text; calculate at least one of local weights and document level weights of each participle in the training data set; and cluster level weights; use core words as positive samples, non-core The word is used as a negative sample, and at least one of the local weight and the document level weight and the weighting coefficient of the cluster level weight are calculated by a machine learning method according to at least one of the local weight of each participle and the document level weight and the cluster level weight.
  • the commodity short text core word extraction method further comprises: removing the stop words and punctuation marks in the short text of the commodity before performing the word segmentation processing on the short text of the commodity.
  • the method for extracting the short text core of the commodity further comprises: after performing word segmentation on the short text of the commodity, counting the frequency of occurrence of all the word segments in the data set, and removing the word segmentation in the short text of the commodity that is lower than the filtering threshold.
  • the short text of the item includes a product title, a product review, or a product information page content.
  • the product part-of-speech information includes one or more of a brand, a series name, a category, a noun, a property word, a style, and a modifier.
  • a commodity short text core word extracting apparatus comprising: a preprocessing module, the preprocessing module comprising: a short text obtaining unit, configured to acquire short text of the commodity in the data set; and a word segmentation unit,
  • the utility model is configured to perform word segmentation processing on a short text of a commodity;
  • a weight determination module comprising a cluster level weight determination submodule, configured to determine a weight of each word segment in the short text of the commodity;
  • the cluster level weight determination submodule comprises: a document vector determining unit, Obtaining a document vector of the short text of the commodity according to the word segment context information of the short text of the commodity; a clustering unit, configured to cluster the short text of the commodity in the data set according to the document vector; and a clustering level weight determining unit, configured to determine the commodity short text
  • Each participle in the present is a cluster level weight of the category to which the short text of the commodity belongs;
  • the core word determining module is configured to determine
  • the weight determining module further includes a local weight determining sub-module and/or a document level weight determining sub-module; wherein the local weight determining sub-module comprises: a commodity part-of-speech information determining unit, configured to determine each of the short texts of the commodity The product part-of-speech information of the word segmentation; the part-of-speech weight determining unit is configured to determine the part-of-speech weight corresponding to the participle in the short text of the commodity according to the correspondence between the part of speech and the part of speech weight; the local weight determining unit is configured to correspond to each participle in the short text of the commodity The part-of-speech weight determines the local weight of each participle; wherein the document level weight determining sub-module is used to determine the reverse file frequency of each participle in the short text of the commodity in the data set, and determines the document level weight of each participle; the core word The determining module is configured to determine a core word of the
  • the document vector determining unit is configured to determine the window parameter according to the average length of the short text of the commodity in the data set, and use the vector operation tool word2vec to use the data set as the input corpus and the determined window parameter. Calculated as the window size, the document vector of the short text of the product in the data set is obtained.
  • the number of clustering unit clusters is determined based on the number of commodity categories.
  • the cluster level weight determining unit is configured to calculate, by using a chi-square formula, a chi-square value of each category in the short text of the commodity in a category to which the short text of the commodity belongs, and use the chi-square value as each participle in the short text of the commodity.
  • the local weight determination module further includes a first part-of-speech weight correspondence unit or a second part-of-speech weight correspondence unit.
  • the first part-of-speech weight correspondence unit is configured to acquire a training corpus including a plurality of short texts for training products, and label the core words in the short texts of the products for training and the product part-of-speech information corresponding to the core words, and have the same The ratio of the number of core words of the product part-of-speech information to the number of all core words in the training corpus is used as the part-of-speech weight corresponding to the part-of-speech information of the commodity.
  • the second part-of-speech weight corresponding unit is configured to establish a correspondence between the search word used in the user search and the short text of the clicked product, and mark the search word as the core word of the corresponding short text of the product, and mark the product part of the word corresponding to the core word.
  • the information, the ratio of the number of core words having the same product part-of-speech information to the number of all core words in the training corpus is used as the part-of-speech weight corresponding to the product part-of-speech information.
  • the local weight determining unit is configured to normalize the part-of-speech weights corresponding to the respective participles in the short text of the commodity, and obtain local weights of the respective partial words.
  • the core word determining module includes: a core weight calculating unit, configured to use at least one of a local weight of each word segment and a document level weight, and a weight summation of the cluster level weight as a core weight of each word segment; a core word; a determining unit for determining a word segment having the largest value of the core weight in the short text of the commodity as the core word of the short text of the commodity.
  • the core word determining module further includes a weighting coefficient determining unit, configured to determine at least one of a local weight of each word segment and a document level weight, and a weighting coefficient of the cluster level weight, and send the weighting coefficient to the core weight calculating unit, and weight
  • the coefficient determining unit comprises: a training data segmentation subunit for performing word segmentation processing on each commodity short text in the training data set; and a training data labeling subunit for labeling core words and non-cores of each commodity short text in the training data set a word segmentation weight calculation sub-unit for calculating at least one of a local weight and a document level weight of each word segment in the training data set and a cluster level weight; a machine learning sub-unit for using the core word as a positive sample, a non-core The word is used as a negative sample, and at least one of the local weight and the document level weight and the weighting coefficient of the cluster level weight are calculated by a machine learning method according to at least one of the local weight of each participle and the coefficient
  • the pre-processing module further includes a data cleaning unit for removing the stop words and punctuation marks in the short text of the product, and transmitting the processed short text of the product to the word segmentation unit.
  • the pre-processing module further includes a word segmentation filtering unit, configured to: after the word segmentation processing on the short text of the commodity, count the frequency of occurrence of all the word segments in the data set, and remove the word segmentation in the short text of the commodity that is lower than the filtering threshold. .
  • the short text of the item includes a product title, a product review, or a product information page content.
  • the product part-of-speech information includes one or more of a brand, a series name, a category, a noun, a property word, a style, and a modifier.
  • a commodity short text core word extracting apparatus includes: a memory; and a processor coupled to the memory, the processor configured to execute any one of the foregoing based on an instruction stored in the memory A short text core word extraction method for commodities.
  • the invention refers to the context information of the word segmentation in the short text of the commodity to obtain the document vector of the short text of the commodity, can make up for the shortcoming of the short text information, and makes the clustering result based on the document vector more accurate, and then according to the segmentation in the commodity essay
  • the weight of the cluster category to which it belongs can more accurately extract the core words from the short text of the commodity.
  • the characteristics of the short text of the commodity can be further optimized, and the result is optimized from the perspective of the product part-of-speech information, which improves the accuracy of the core word determination.
  • the core word extraction process can be further optimized from the importance of the data set level, thereby improving the accuracy of the core word determination.
  • FIG. 1 is a flow chart of an embodiment of a method for extracting a short text core word of a commodity according to the present invention.
  • FIG. 2 is a flow chart of another embodiment of a method for extracting a short text core word of a commodity according to the present invention.
  • FIG. 3A is a flow chart of still another embodiment of a method for extracting a short text core word of a commodity according to the present invention.
  • FIG. 3B is a flowchart of still another embodiment of a method for extracting a short text core word of a commodity according to the present invention.
  • FIG. 4 is a flow chart of determining weighting coefficients for respective weights according to the present invention.
  • Fig. 5 is a structural diagram showing an embodiment of a commodity short text core word extracting apparatus of the present invention.
  • Fig. 6 is a structural diagram showing another embodiment of the commodity short text core word extracting apparatus of the present invention.
  • Fig. 7 is a structural diagram showing still another embodiment of the commodity short text core word extracting apparatus of the present invention.
  • FIG. 8 is a structural diagram of still another embodiment of the commodity short text core word extracting apparatus of the present invention.
  • Fig. 9 is a structural diagram showing still another embodiment of the commodity short text core word extracting apparatus of the present invention.
  • the present invention has been made in order to improve the accuracy of extracting core words from short texts of commodities.
  • a short text of a commodity can be regarded as a document, which often has characteristics such as sparse features and unclear themes.
  • FIG. 1 is a flow chart of an embodiment of a method for extracting a short text core word of a commodity according to the present invention. As shown in FIG. 1, the method of this embodiment includes:
  • pre-processing the short texts of the products in the data set to obtain the word segmentation of the short text of each product.
  • the preprocessing can be implemented by using steps S102 to S105:
  • Step S102 Acquire short text of the commodity in the data set.
  • the data set is a collection of short texts of the commodity to be tested, and can be obtained from a database storing short texts of the commodity.
  • the data warehouse tool hive provided by Hadoop can be used to query and obtain short text data of products, or can be obtained by other methods such as web crawling according to business requirements and system settings.
  • the product short text may be, for example, a product title, a product review, or a product information page content
  • the product information page content may be, for example, short text information such as a material or a model number of the product.
  • step S103 data cleaning is performed on the short text of the product.
  • step S103 can be selectively performed according to business needs.
  • Data cleansing includes removing stop words and punctuation marks in the short text of the item in order to improve the efficiency of subsequent data processing.
  • the stop words can be pre-set.
  • a stop word list including non-meaning words or non-target words such as auxiliary words and interjections can be set, and the data set is cleaned according to the stop word table.
  • step S104 word segmentation processing is performed on the short text of the product to obtain each word segmentation.
  • Step S105 the frequency of occurrence of all the participles in the statistical data set is removed, and the participles whose frequency of occurrence in the short text of the commodity is lower than the filtering threshold are removed.
  • step S105 can be selectively performed after performing word segmentation processing according to business needs. By filtering the participles whose frequency of occurrence is lower than the filtering threshold, the extraction efficiency of subsequent core words can be improved.
  • the clustering level weights of the word segmentation can be determined to measure the importance of a segmentation word in a certain class.
  • the clustering level weights of the word segmentation may be determined by using steps S106-S110:
  • Step S106 obtaining a document vector of the short text of the commodity according to the word segmentation context information of the short text of the commodity.
  • One implementation method is that the vector operation tool word2vec based on the word segmentation context is used to obtain the document vector of the short text of the commodity, thereby characterizing each short text of the commodity as a k-dimensional vector, and the specific dimension can be set.
  • the window parameter may be determined according to the average length of the short text of the commodity in the data set, and the data set is calculated as the input corpus and the determined window parameter as the window size to obtain the document vector of the short text of the commodity in the data set.
  • Reasonable determination of window size parameters can improve the accuracy and efficiency of document vector calculations.
  • the size parameter is used to indicate the size of the vector. When the size parameter value is large, the calculation accuracy is higher, but the performance requirements of the computing device are also improved accordingly. When used, you can adjust the size parameter as needed.
  • Step S108 clustering the short texts of the commodities in the data set according to the document vector.
  • the existing distance-based clustering algorithm can be used for clustering, such as K-MEANS algorithm, K-MEDOIDS algorithm, BIRCH algorithm and the like.
  • K-MEANS algorithm K-MEDOIDS algorithm
  • BIRCH algorithm BIRCH algorithm
  • Each class can be obtained by inputting the document vector into a specific clustering algorithm for clustering.
  • the number of clusters can be determined according to the number of commodity categories. For example, the number of clusters can be set to be approximately equal to the number of commodity categories.
  • Category refers to the category name formed by the e-commerce website according to the characteristics of the product, such as women's clothing, luggage, daily chemicals, digital products, etc. The specific number, name, and degree of subdivision can be based on business needs and products. Make settings. Therefore, the clustering result based on the document vector is more accurate, and the core word can be extracted from the short text of the commodity more accurately according to the weight of the clustering category of the segmentation of the commodity short text.
  • Step S110 determining a cluster level weight of each participle in the short text of the commodity in the category to which the short text of the commodity belongs.
  • the cluster level weight can be determined by calculating the correlation between each participle in the short text of the commodity and the category of the short text of the commodity. For example, using the chi-square formula to calculate the chi-square value of each category part in the short text of the product after the category of the short text of the product, the chi-square value can be used as the cluster of the category of the short text of the commodity in the short text of the commodity. Level weight.
  • the short text belongs to the "digital product” category.
  • the chi-square value of the word segmentation "mobile phone” can be calculated, for example, with reference to Table 1.
  • Table 1 is an independent sample four-grid table of the chi-square algorithm. As shown in Table 1, let A be the number of short texts that appear in the “digital products” category of “mobile phones”, B be the number of short texts that appear in the categories other than “digital products” of “mobile phones”, and C is “digital products”. "The number of short texts that do not include "mobile phone” in the class, D is the number of short texts that do not contain "mobile phone” in classes other than “digital products”, and N is the number of short texts of all products participating in the clustering.
  • the chi-square value K 2 is the clustering level weight of the "mobile phone" in the short text.
  • step S112 the core word is determined by using step S112:
  • Step S112 determining a core word of the short text of the commodity according to the cluster level weight of each word segment.
  • the word segment with the largest value of the cluster level weight may be used as the core word.
  • Other methods can also be used to determine the core words based on the values of the cluster level weights as needed.
  • the defect of the short text information can be compensated for, and the clustering result based on the document vector is more accurate, and then the segmentation is based on the short text of the commodity.
  • the weight of the clustering category can more accurately extract the core words from the short text of the commodity.
  • the result in addition to the clustering level weight of the reference word segmentation, the result can be optimized in combination with other weights.
  • the importance of the product part-of-speech information corresponding to the participle in the short text of the single item can be measured in combination with the local weight, or/and the importance of the participle in the entire data set can be measured in combination with the document level weight.
  • FIG. 2 is a flow chart of another embodiment of a method for extracting a short text core word of a commodity according to the present invention. As shown in FIG. 2, the method of this embodiment includes:
  • pre-processing the short texts of the products in the data set to obtain the word segmentation of the short text of each product.
  • the implementation of the pre-processing can be referred to, for example, the description of steps S102-S105 in the foregoing.
  • the local weight of the word segmentation may also be determined to measure the importance of the word segmentation in the short text of the single article. For example, the local weights may be determined using steps S206-S210:
  • Step S206 determining the product part-of-speech information of each participle in the short text of the commodity.
  • the product part-of-speech information refers to the introduction of part of speech information having the characteristics of the product, such as a brand, a series name, a category, an attribute word, a style, a modifier, a title, an author, and the like.
  • the same participle in the short text of the same commodity can have a variety of commodity part of speech information.
  • the product part-of-speech information dictionary may be preset, and the dictionary includes a plurality of words and corresponding commodity part-of-speech information.
  • the word-of-speech information corresponding to the participle in the product part-of-speech information dictionary may be searched one by one and marked.
  • Step S208 determining the part of speech weight corresponding to the word segmentation in the short text of the commodity according to the correspondence between the product part of speech and the part of speech weight.
  • Part-of-speech weight refers to the importance of a certain category of commodity part-of-speech information in the short text of the commodity. For example, in the product title, brands and categories are relatively important words, while attribute words such as "500ml” and "2 meters” are words of lower importance. Therefore, the part-of-speech weight corresponding to the product part-of-speech information can be set according to the importance of the product part-of-speech information and the specific business requirement. The above process can be set manually or by a statistically based method.
  • Step S210 Determine local weights of each word segment according to the part of speech weight corresponding to each participle in the short text of the product.
  • a participle When a participle has a plurality of product part-of-speech information, a plurality of part-of-speech weights corresponding to the part-of-speech information of the goods may be accumulated.
  • the part-of-speech weight corresponding to the participle can be directly used as the local weight, or the part-of-speech weight of all the participles in the short text of the commodity can be normalized, and the normalized result is used as the participle.
  • Local weight Let x be the local weight of a participle before normalization, y be the local weight of the part after normalization, and min and max respectively represent the minimum and maximum of the local weight in the short text of the commodity to which the part belongs. The value of y can be normalized by equation (2):
  • step S212 is used to determine the core word:
  • Step S212 determining a core word of the short text of the commodity according to the cluster level weight and the local weight of each word segment.
  • the invention also provides a method for establishing a correspondence between a part-of-speech information and a part-of-speech weight based on a statistical method.
  • An exemplary method firstly, acquiring a training corpus, the training corpus comprising a plurality of short texts for training; and then labeling the core words of the short texts of the products for training and the product vocabulary information corresponding to the core words, for example
  • the core words and their product part-of-speech can be marked by the offline dictionary, and the core words recorded in the dictionary and their corresponding product part-of-speech can be used to mark the core words in the training expectation; finally, the same commodity part-of-speech information will be used.
  • the ratio of the number of core words to the number of all core words in the training corpus is used as the part of speech weight corresponding to the part of speech information of the commodity.
  • Another exemplary method can also automatically mark core words based on the user's search behavior. Firstly, the correspondence between the search word used in the user search and the short text of the clicked product is established; then, the search term is marked as the core word of the corresponding short text of the product, and the product part-of-speech information corresponding to the core word is marked; finally, there will be The ratio of the number of core words of the same commodity part-of-speech information to the number of all core words in the training corpus as the product's part-of-speech letter The part of speech weight corresponding to the interest.
  • the corresponding relationship between the search term and the short text of the product clicked by the user may be established by searching the click log and marked.
  • the search click log the user searches for the product using the search term "mobile phone”, and the search result is "Apple iPhone 6s 64G Deep Space Gray Telecom 4G Mobile Phone” and “Millet Note White Mobile 4G Mobile Phone", and the user clicks on the former, that is, " Apple iPhone 6s 64G deep gray gray telecom 4G mobile phone, then labeled "mobile phone” as "Apple iPhone 6s 64G deep gray gray telecommunications 4G mobile phone” core words.
  • the above-mentioned statistical-based method is used to determine the part-of-speech weight corresponding to the part-of-speech information of the product, so that the value of the part-of-speech weight is more applicable to the current use environment, thereby improving the accuracy of the core word extraction.
  • FIG. 3A is a flow chart of still another embodiment of a method for extracting a short text core word of a commodity according to the present invention. As shown in FIG. 3A, the method of this embodiment includes:
  • pre-processing the short texts of the products in the data set to obtain the word segmentation of the short text of each product.
  • the implementation of the pre-processing can be referred to, for example, the description of steps S102-S105 in the foregoing.
  • the document level weight of the word segmentation may also be determined to measure the importance of the word segmentation in all the short texts of the product in the data set. For example, step S306 may be used to determine document level weights:
  • Step S306 determining the frequency of the reverse file of each participle in the short text of the commodity in the data set, and determining it as the document level weight of each word segment.
  • the reverse file frequency indicates the frequency of occurrence of a document with a certain participle in the corpus. The higher the frequency, the smaller the corresponding idf value. That is, if a participle appears in a large number of different documents, it means that the participle cannot represent the characteristics of a document. Since each product short text is equivalent to one document, the inverse file frequency idf value of each participle in the short text of the product in the data set can be calculated by formula (3):
  • M is the total number of short articles
  • L is the number of short texts of the product containing a certain participle
  • idf is the weight of the binning of the participle in the short text of the commodity.
  • step S312a the core word is determined by using step S312a:
  • Step S312a determining a core word of the short text of the commodity according to the cluster level weight of each word segment and the document level weight.
  • the core word extraction process can be further optimized from the importance of the data set level, thereby improving the accuracy of the core word determination.
  • the cluster level weight, the local weight and the document level weight can also be respectively obtained by combining the respective embodiments, and then the cluster level weight, the local weight and the document level weight are collectively used as the core.
  • the determination of the word is based on, that is, step S312b is performed, and the core word of the short text of the commodity is determined according to the cluster level weight, the local weight, and the document level weight of each word segment.
  • the above-mentioned commodity short text core word extraction method can be referred to FIG. 3B.
  • the core word is determined according to more than one weight
  • at least one of the local weight of each word segment and the document level weight and the weight of the cluster level weight can be firstly used as the core weight of each word segment, and then the core of the short text in the commodity is The participle with the highest value of the weight is determined as the core word of the short text of the commodity.
  • FIG. 4 is a flow chart of a method for determining weighting coefficients for respective weights according to the present invention. As shown in FIG. 4, the method of this embodiment includes:
  • Step S402 performing word segmentation processing on each short text of the commodity in the training data set.
  • Step S404 labeling the core words and non-core words of each commodity short text in the training data set.
  • the annotation of the core word and the non-core word may be manually marked; or the search word used in the user search may be used as the core word corresponding to the search result clicked by the user according to the search click log.
  • Step S406 calculating at least one of a local weight and a document level weight of each participle in the training data set and a cluster level weight.
  • the specific calculation method adopts the method for calculating the local weight, the document level weight, and the cluster level weight of each word segment in the foregoing various embodiments.
  • Step S408 using the core word as a positive sample and the non-core word as a negative sample, according to at least one of the local weight of each participle and the document level weight and the cluster level weight, using a machine learning method, such as linear regression, decision tree, An algorithm such as a neural network calculates at least one of a local weight and a document level weight and a weighting coefficient of the cluster level weight.
  • the direct summation of different weights can be directly used as the core weight, that is, the weighting coefficients of each weight are all 1, which is convenient for calculation.
  • Fig. 5 is a structural diagram showing an embodiment of a commodity short text core word extracting apparatus of the present invention.
  • the apparatus of this embodiment includes a pre-processing module 52, a weight determination module 54, and a core word determination module 56.
  • the pre-processing module 52 includes a short text obtaining unit 522 for acquiring short text of the commodity in the data set, and a word segmentation unit 524 for performing word segmentation processing on the short text of the commodity.
  • the weight determination module 54 includes a cluster level weight determination sub-module 542 for determining the weight of each word segment in the short text of the product.
  • the cluster level weight determining sub-module 542 includes: a document vector determining unit 5422, configured to obtain a document vector of the commodity short text according to the word segment context information of the commodity short text; and a clustering unit 5424, configured to compare the products in the data set according to the document vector
  • the short text is clustered;
  • the cluster level weight determining unit 5426 is configured to determine the cluster level weight of each category in the short text of the commodity in the category to which the short text of the commodity belongs.
  • the core word determination module 56 is configured to determine the core word of the short text of the commodity according to the cluster level weight of each word segmentation.
  • the defect of the short text information can be compensated for, and the clustering unit is more accurate based on the clustering result of the document vector.
  • the weight of the clustering level weight determining unit in the clustering category to which the commodity short text belongs can more accurately extract the core word from the short text of the commodity.
  • the short text of the product may include a product title or a product review.
  • the document vector determining unit 5422 can be configured to determine the window parameter according to the average length of the short text of the commodity in the data set, and use the vector operation tool word2vec to calculate the data set as the input corpus and the determined window parameter as the window size to obtain the data set.
  • the document vector of the short text within the product can be configured to determine the window parameter according to the average length of the short text of the commodity in the data set, and use the vector operation tool word2vec to calculate the data set as the input corpus and the determined window parameter as the window size to obtain the data set.
  • the number of clusters of the clustering unit 5424 can be determined according to the quantity of the commodity category, so that the clustering result based on the document vector is more accurate, and the weight of the clustering category according to the segmentation of the short text of the commodity can be more accurately obtained from the commodity essay.
  • the core words are extracted from this book.
  • the cluster level weight determining unit 5426 can be configured to calculate, by using a chi-square formula, a chi-square value of each category in the short text of the product in the category to which the short text of the product belongs, and use the chi-square value as each participle in the short text of the product in the commodity.
  • the cluster level weight of the category to which the short text belongs can be configured to calculate, by using a chi-square formula, a chi-square value of each category in the short text of the product in the category to which the short text of the product belongs.
  • Fig. 6 is a structural diagram showing another embodiment of the commodity short text core word extracting apparatus of the present invention.
  • the weight determination module 54 of this embodiment further includes a local weight determination sub-module 644 and/or a document level weight determination sub-module 646.
  • the local weight determination sub-module 644 includes: a product part-of-speech information determining unit 6442, configured to determine product part-of-speech information of each participle in the short text of the product; the part-of-speech weight determining unit 6444 is configured to determine the correspondence relationship between the part of speech and the part of speech weight The part-of-speech weight corresponding to the participle in the short text of the commodity; the local weight determining unit 6446 is configured to determine the local weight of each participle according to the part-of-speech weight corresponding to each participle in the short text of the commodity.
  • the cluster level weight determination sub-module On the basis of setting the cluster level weight determination sub-module, by combining the local weight determination sub-module, the characteristics of the short text of the commodity can be further optimized, and the result is optimized from the perspective of the commodity part-of-speech information, which improves the accuracy of the core word determination. .
  • the document level weight determination sub-module 646 is configured to determine the reverse file frequency of each participle in the short text of the commodity in the data set, and determine it as the document level weight of each word segment; the core word determining module 56 is configured to use each word segmentation according to each word segmentation At least one of the local weight and the document level weight and the cluster level weight determine the core word of the short text of the commodity.
  • the core word extraction process can be further optimized from the importance of the data set level, thereby improving the accuracy of the core word determination.
  • the product part-of-speech information may include one or more of a brand, a series name, a category, a noun, a property word, a style, and a modifier.
  • the local weight determining unit 6446 can be used to normalize the part-of-speech weights corresponding to each participle in the short text of the product, and obtain the local weight of each participle. Through the normalized processing, the statistical distribution characteristics of the local weights of each participle can be visually reflected.
  • the core word determining module 56 may include: a core weight calculating unit 662, configured to use at least one of a local weight of each word segment and a document level weight, and a weight summation of the cluster level weight as a core weight of each word segment;
  • the unit 664 is configured to determine a word segment having the largest value of the core weight in the short text of the commodity as the core word of the short text of the commodity.
  • Fig. 7 is a structural diagram showing still another embodiment of the commodity short text core word extracting apparatus of the present invention.
  • the local weight determination sub-module 644 may further include a first part-of-speech weight correspondence unit 7442 or a second part-of-speech weight correspondence unit 7444.
  • the first part-of-speech weight correspondence unit 7442 is configured to acquire a training corpus including a plurality of short texts for training products, and label the core words of the short texts of the products for training and the product part-of-speech information corresponding to the core words, and have the same product.
  • the ratio of the number of core words of the part of speech information to the number of all core words in the training corpus is used as the part of speech weight corresponding to the part of speech information of the commodity.
  • the second part of speech weight corresponding unit 7444 is used to establish a user search Corresponding relationship between the search term and the short text of the clicked product, the search term is marked as the core word of the corresponding short text of the product, and the product part-of-speech information corresponding to the core word is marked, and then the core word with the same product part-of-speech information is added.
  • the ratio of the quantity to the number of all core words in the training corpus is used as the part of speech weight corresponding to the part of speech information of the commodity.
  • the second part-of-speech weight corresponding unit 7444 can establish a correspondence between the search word and the short text of the product clicked by the user, for example, by searching the click log.
  • the value of the part-of-speech weight can be more suitable for the current use environment, thereby improving the core word extraction. accuracy.
  • the core word determining module 56 may further include a weighting coefficient determining unit 762 for determining at least one of the local weight of each word segment and the document level weight and the weighting coefficient of the cluster level weight and transmitting to the core weight calculating unit 662.
  • the weighting coefficient determining unit 762 includes: a training data segmentation sub-unit 7662 for performing word segmentation processing on each commodity short text in the training data set; and a training data labeling sub-unit 7624 for labeling the core of each commodity short text in the training data set a word and non-core word; a word segmentation weight calculation sub-unit 7626, configured to calculate at least one of a local weight and a document level weight of each word segment in the training data set and a cluster level weight; a machine learning sub-unit 7628 for using the core word As a positive sample, the non-core word is used as a negative sample, and at least one of the local weight and the document level weight is calculated by the machine learning method according to at least one of the local weight of each participle and the document level weight and the cluster level weight.
  • the weighting factor of the hierarchical weight includes: a training data segmentation sub-unit 7662 for performing word segmentation processing on each commodity short text in the training data set; and a training data labeling sub
  • the weighting coefficient determining unit when the core words in the short text of the commodity are determined according to the plurality of weights, the proportion between the respective weights can be adjusted, and the accuracy of the core word determination can be improved.
  • the pre-processing module 52 may further include a data cleaning unit 722 for removing the stop words and punctuation marks in the short text of the commodity, and transmitting the processed short text of the commodity to the word segmentation unit 524.
  • a data cleaning unit 722 for removing the stop words and punctuation marks in the short text of the commodity, and transmitting the processed short text of the commodity to the word segmentation unit 524.
  • the pre-processing module 52 may further include a word segmentation filtering unit 724, for performing the word segmentation processing on the short text of the commodity, counting the frequency of occurrence of all the word segments in the data set, and removing the word segmentation in the short text of the commodity that is lower than the filtering threshold.
  • a word segmentation filtering unit 724 for performing the word segmentation processing on the short text of the commodity, counting the frequency of occurrence of all the word segments in the data set, and removing the word segmentation in the short text of the commodity that is lower than the filtering threshold.
  • FIG. 8 is a structural diagram of still another embodiment of the commodity short text core word extracting apparatus of the present invention.
  • the apparatus 800 of this embodiment includes a memory 810 and a processor 820 coupled to the memory 810, the processor 820 being configured to perform any of the foregoing embodiments based on instructions stored in the memory 810.
  • the short text core word extraction method for goods includes a memory 810 and a processor 820 coupled to the memory 810, the processor 820 being configured to perform any of the foregoing embodiments based on instructions stored in the memory 810.
  • the memory 810 may include, for example, a system memory, a fixed non-volatile storage medium, or the like.
  • the system memory stores, for example, an operating system, an application, a boot loader, and other programs.
  • Fig. 9 is a structural diagram showing still another embodiment of the commodity short text core word extracting apparatus of the present invention.
  • the apparatus 800 of this embodiment includes a memory 810 and a processor 820, and may further include an input/output interface 930, a network interface 940, a storage interface 950, and the like. These interfaces 930, 940, 950 and the memory 810 and the processor 820 can be connected, for example, via a bus 960.
  • the input/output interface 930 provides a connection interface for input and output devices such as a display, a mouse, a keyboard, and a touch screen.
  • Network interface 940 provides a connection interface for various networked devices.
  • the storage interface 950 provides a connection interface for an external storage device such as an SD card or a USB flash drive.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code. .
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种商品短文本核心词提取方法和装置,涉及大数据处理领域。其中的商品短文本核心词提取方法包括:获取数据集内的商品短文本;对商品短文本进行分词处理;根据商品短文本的分词上下文信息获得商品短文本的文档向量;根据文档向量对数据集内的商品短文本进行聚类;确定商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重;根据各个分词的聚类层级权重确定商品短文本的核心词。本发明参考了商品短文本中分词的上下文信息来获得该商品短文本的文档向量,可以弥补短文本信息量少的缺陷,使基于文档向量的聚类结果更准确,进而依据分词在其商品短文本所属聚类类别的权重可以更加准确地从商品短文本中提取出核心词。

Description

商品短文本核心词提取方法和装置 技术领域
本发明涉及大数据处理领域,尤其涉及一种商品短文本核心词提取方法和装置。
背景技术
随着电子商务网站的快速发展,电子商务网站提供了品类繁多的商品以及非常详细的商品信息,越来越多的顾客在网上购买商品、评论商品。因此,电子商务网站产生了大量的商品标题数据和商品评论数据。早期的通过人工标注挖掘核心词的方法已经不能适应从海量商品数据挖掘核心词的需要。
目前一般采用词袋模型(BOW,Bag of Words)自动提取文本的核心词。在词袋模型中,文本被看作是无序的词汇集合,忽略语法甚至是单词的顺序。因此这种方法在提取长文本的核心词时效果较好,用于短文本时则效果不佳。相较于长文本,短文本包含的词语较少,因此具有特征稀疏、主题不明确的特点,准确提取核心词的难度更大。
发明内容
本发明提供了一种商品短文本的核心词提取方法和装置,用来改善从商品短文本中提取核心词的准确度问题。
根据本发明的第一个方面,提供了一种商品短文本核心词提取方法,包括:获取数据集内的商品短文本;对商品短文本进行分词处理;根据商品短文本的分词上下文信息获得商品短文本的文档向量;根据文档向量对数据集内的商品短文本进行聚类;确定商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重;根据各个分词的聚类层级权重确定商品短文本的核心词。
在一个实施例中,商品短文本核心词提取方法还包括:确定商品短文本中的各个分词的商品词性信息,根据商品词性和词性权重的对应关系确定商品短文本中的分词对应的词性权重,并根据商品短文本中的各个分词对应的词性权重确定各个分词的局部权重;和/或,确定商品短文本中的各个分词在数据集内的逆向文件频率,将其确定为各个分词的文档层级权重;根据各个分词的聚类层级权重确定商品短文本的核心词包括:根据各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重确定 商品短文本的核心词。
在一个实施例中,根据商品短文本的分词上下文信息获得商品短文本的文档向量包括:根据数据集内商品短文本的平均长度确定窗口参数;采用向量运算工具word2vec,将数据集作为输入语料、确定的窗口参数作为窗口大小进行计算,得到数据集内的商品短文本的文档向量。
在一个实施例中,聚类的数量根据商品品类的数量确定。
在一个实施例中,确定商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重包括:采用卡方公式计算商品短文本中的各个分词在商品短文本所属的类别的卡方值,将卡方值作为商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重。
在一个实施例中,采用以下方法确定商品词性及其词性权重的对应关系:获取训练语料,训练语料包括若干用于训练的商品短文本,并标注各个用于训练的商品短文本中的核心词和核心词对应的商品词性信息,将具有相同商品词性信息的核心词的数量与训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重;或者,建立用户搜索时采用的搜索词和点击的商品短文本的对应关系,将搜索词标注为对应的商品短文本的核心词,并标注核心词对应的商品词性信息,将具有相同商品词性信息的核心词的数量与训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重。
在一个实施例中,根据商品短文本中的各个分词对应的词性权重确定各个分词的局部权重包括:对商品短文本中的各个分词对应的词性权重进行归一化处理,获得各个分词的局部权重。
在一个实施例中,根据各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重确定商品短文本的核心词包括:将各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重加权求和作为各个分词的核心权重;将商品短文本中核心权重的值最大的分词确定为商品短文本的核心词。
在一个实施例中,采用以下方法确定局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数:对训练数据集内的各个商品短文本进行分词处理;标注训练数据集内各个商品短文本的核心词和非核心词;计算训练数据集内各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重;将核心词作为正样本,非核心 词作为负样本,根据各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重,采用机器学习的方法计算局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数。
在一个实施例中,商品短文本核心词提取方法还包括:在对商品短文本进行分词处理之前,去掉商品短文本中的停用词和标点符号。
在一个实施例中,商品短文本核心词提取方法还包括:在对商品短文本进行分词处理之后,统计数据集内所有分词的出现频率,去掉商品短文本中出现频率低于过滤阈值的分词。
在一个实施例中,商品短文本包括商品标题、商品评论或商品信息页内容。
在一个实施例中,商品词性信息包括品牌、系列名称、品类、名词、属性词、款式、修饰词中的一个或多个。
根据本发明的第二个方面,提供一种商品短文本核心词提取装置,包括:预处理模块,预处理模块包括:短文本获取单元,用于获取数据集内的商品短文本;分词单元,用于对商品短文本进行分词处理;权重确定模块,包括聚类层级权重确定子模块,用于确定商品短文本中各个分词的权重;聚类层级权重确定子模块包括:文档向量确定单元,用于根据商品短文本的分词上下文信息获得商品短文本的文档向量;聚类单元,用于根据文档向量对数据集内的商品短文本进行聚类;聚类层级权重确定单元,用于确定商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重;核心词确定模块,用于根据各个分词的聚类层级权重确定商品短文本的核心词。
在一个实施例中,权重确定模块还包括局部权重确定子模块和/或文档层级权重确定子模块;其中,局部权重确定子模块包括:商品词性信息确定单元,用于确定商品短文本中的各个分词的商品词性信息;词性权重确定单元,用于根据商品词性和词性权重的对应关系确定商品短文本中的分词对应的词性权重;局部权重确定单元,用于根据商品短文本中的各个分词对应的词性权重确定各个分词的局部权重;其中,文档层级权重确定子模块用于确定商品短文本中的各个分词在数据集内的逆向文件频率,将其确定为各个分词的文档层级权重;核心词确定模块用于根据各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重确定商品短文本的核心词。
在一个实施例中,文档向量确定单元用于根据数据集内商品短文本的平均长度确定窗口参数,采用向量运算工具word2vec,将数据集作为输入语料、确定的窗口参数 作为窗口大小进行计算,得到数据集内的商品短文本的文档向量。
在一个实施例中,聚类单元聚类的数量根据商品品类的数量确定。
在一个实施例中,聚类层级权重确定单元用于采用卡方公式计算商品短文本中的各个分词在商品短文本所属的类别的卡方值,将卡方值作为商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重。
在一个实施例中,局部权重确定模块还包括第一词性权重对应单元或者第二词性权重对应单元。其中,第一词性权重对应单元用于获取包括若干用于训练的商品短文本的训练语料,并标注各个用于训练的商品短文本中的核心词和核心词对应的商品词性信息,将具有相同商品词性信息的核心词的数量与训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重。其中,第二词性权重对应单元用于建立用户搜索时采用的搜索词和点击的商品短文本的对应关系,将搜索词标注为对应的商品短文本的核心词,并标注核心词对应的商品词性信息,再将具有相同商品词性信息的核心词的数量与训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重。
在一个实施例中,局部权重确定单元用于对商品短文本中的各个分词对应的词性权重进行归一化处理,获得各个分词的局部权重。
在一个实施例中,核心词确定模块包括:核心权重计算单元,用于将各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重加权求和作为各个分词的核心权重;核心词确定单元,用于将商品短文本中核心权重的值最大的分词确定为商品短文本的核心词。
在一个实施例中,核心词确定模块还包括加权系数确定单元,用于确定各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数并发送给核心权重计算单元,加权系数确定单元包括:训练数据分词子单元,用于对训练数据集内的各个商品短文本进行分词处理;训练数据标注子单元,用于标注训练数据集内各个商品短文本的核心词和非核心词;分词权重计算子单元,用于计算训练数据集内各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重;机器学习子单元,用于将核心词作为正样本,非核心词作为负样本,根据各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重,采用机器学习的方法计算局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数。
在一个实施例中,预处理模块还包括数据清理单元,用于去掉商品短文本中的停用词和标点符号,并将处理后的商品短文本发送给分词单元。
在一个实施例中,预处理模块还包括分词过滤单元,用于在对商品短文本进行分词处理之后,统计数据集内所有分词的出现频率,去掉商品短文本中出现频率低于过滤阈值的分词。
在一个实施例中,商品短文本包括商品标题、商品评论或商品信息页内容。
在一个实施例中,商品词性信息包括品牌、系列名称、品类、名词、属性词、款式、修饰词中的一个或多个。
根据本发明的第三个方面,提供一种商品短文本核心词提取装置,包括:存储器;以及耦接至存储器的处理器,处理器被配置为基于存储在存储器中的指令,执行前述任意一种商品短文本核心词提取方法。
本发明参考了商品短文本中分词的上下文信息来获得该商品短文本的文档向量,可以弥补短文本信息量少的缺陷,使基于文档向量的聚类结果更准确,进而依据分词在其商品短文本所属聚类类别的权重可以更加准确地从商品短文本中提取出核心词。
此外,在聚类层级权重的基础上,通过结合分词的局部权重,可以进一步针对商品短文本的特点,从其商品词性信息的角度对结果进行优化,提高了核心词确定的准确性。
此外,在聚类层级权重的基础上,通过结合分词的文档层级权重,能够进一步从数据集层面的重要性对核心词提取过程进行优化,从而提升了核心词确定的准确性。
通过以下参照附图对本发明的示例性实施例的详细描述,本发明的其它特征及其优点将会变得清楚。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明商品短文本核心词提取方法的一个实施例的流程图。
图2为本发明商品短文本核心词提取方法的另一个实施例的流程图。
图3A为本发明商品短文本核心词提取方法又一个实施例的流程图。
图3B为本发明商品短文本核心词提取方法再一个实施例的流程图。
图4为本发明确定各个权重的加权系数的流程图。
图5为本发明商品短文本核心词提取装置的一个实施例的结构图。
图6为本发明商品短文本核心词提取装置的另一个实施例的结构图。
图7为本发明商品短文本核心词提取装置的又一个实施例的结构图。
图8为本发明商品短文本核心词提取装置的再一个实施例的结构图。
图9为本发明商品短文本核心词提取装置的再一个实施例的结构图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
为了改善从商品短文本中提取核心词的准确度,提出本发明。在本发明中,一条商品短文本可以视为一个文档,其往往具有特征稀疏、主题不明确等特点。
下面参考图1描述本发明一个实施例的商品短文本核心词提取方法。
图1为本发明商品短文本核心词提取方法的一个实施例的流程图。如图1所示,该实施例的方法包括:
首先,对数据集内的商品短文本进行预处理,得到各个商品短文本的分词。例如,可以采用步骤S102~S105实现预处理:
步骤S102,获取数据集内的商品短文本。
其中,数据集为待测商品短文本的集合,可以从存储商品短文本的数据库中获取。例如,可以采用Hadoop提供的数据仓库工具hive进行商品短文本数据的查询和获取,也可以根据业务需求和系统设置,采用例如网页爬虫等其他方式获取。商品短文本例如可以为商品标题、商品评论或商品信息页内容等,商品信息页内容例如可以为有关商品的材质、型号等短文本信息。
步骤S103,对商品短文本进行数据清理。
其中,可以根据业务需要选择性地进行步骤S103。数据清理例如包括去掉商品短文本中的停用词和标点符号,以便提升后续数据处理的效率。其中,停用词可以进行预先设定,例如可以设置包含助词、感叹词等无意义词汇或非目标词汇的停用词表,根据停用词表清理数据集。
步骤S104,对商品短文本进行分词处理得到各个分词。
步骤S105,统计数据集内所有分词的出现频率,去掉商品短文本中出现频率低于过滤阈值的分词。
其中,可以根据业务需要,在进行分词处理之后选择性地进行步骤S105。通过过滤出现频率低于过滤阈值的分词,可以提升后续核心词的提取效率。
在预处理之后,可以确定分词的聚类层级权重,以衡量一个分词在某一类中的重要程度。例如,可以采用步骤S106~S110确定分词的聚类层级权重:
步骤S106,根据商品短文本的分词上下文信息获得商品短文本的文档向量。
其中,一种实现方法为,采用基于分词上下文语境的向量运算工具word2vec获得商品短文本的文档向量,从而将每条商品短文本表征为一个k维的向量,具体的维度可以设定。
在采用word2vec工具进行文档向量的计算时,还可以调整输入参数以便对结果进行优化。例如,可以根据数据集内商品短文本的平均长度确定窗口参数,将数据集作为输入语料、确定的窗口参数作为窗口大小进行计算得到数据集内的商品短文本的文档向量。合理地确定窗口大小参数可以提高文档向量计算的准确性和效率。例如,还可以对size参数进行优化。size参数用于表示向量的大小,当size参数值较大时,计算的精度较高,但是对计算设备的性能要求也相应提高。使用时,可以根据需要对size参数进行调整。
步骤S108,根据文档向量对数据集内的商品短文本进行聚类。
由于文档向量可以将商品短文本抽象为向量空间中的若干点,因此可以采用现有的基于距离的聚类算法进行聚类,例如K-MEANS算法、K-MEDOIDS算法、BIRCH算法等等。将文档向量输入特定的聚类算法进行聚类即可得到各个类。
在进行聚类时,还需要合理地确定聚类的数量,以使得具有共性的某类商品短文本能够尽可能划分到一类,尽量避免差异较大的商品短文本被划分为一类或者很相似的一些商品短文本却被聚到不同的类。
对于商品短文本,其聚类数量可以根据商品品类的数量确定,例如可以设置聚类数量大致等于商品品类的数量。品类是指电商网站中根据商品特性对商品进行分类后形成的类别名称,例如女装、箱包、日化、数码产品等,具体的品类数量、名称、细分程度等可以根据业务需求和产品情况进行设置。从而,使基于文档向量的聚类结果更准确,进而依据分词在其商品短文本所属聚类类别的权重可以更加准确地从商品短文本中提取出核心词。
步骤S110,确定商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重。
其中,可以通过计算商品短文本中的各个分词与商品短文本所属类别的相关性来确定聚类层级权重。例如,采用卡方公式计算商品短文本中的各个分词在商品短文本所属的类别的卡方值后,可以将卡方值作为商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重。
例如,对于商品短文本“Apple iPhone 6s Plus(A1699)64G玫瑰金色移动联通电信4G手机”,设经过步骤S108的聚类操作后,该短文本属于“数码产品”类别。分词“手机”的卡方值例如可以参考表1进行计算。表1为卡方算法的独立样本四格表。如表1所示,设A为“手机”在“数码产品”类中出现的短文本数量,B为“手机”在“数码产品”以外的类中出现的短文本数量,C为“数码产品”类中不包含“手机”的短文本数量,D为“数码产品”以外的类中不包含“手机”的短文本数量,N为参与聚类的所有商品短文本的数量。
表1
  属于“数码产品” 不属于“数码产品” 总计
包含“手机” A B A+B
不包含“手机” C D C+D
总计 A+C B+D N
对于短文本“Apple iPhone 6s Plus(A1699)64G玫瑰金色移动联通电信4G手机”中分词“手机”的卡方值,可以采用公式(1)进行计算:
Figure PCTCN2017072157-appb-000001
卡方值K2即为该短文本中“手机”的聚类层级权重。
然后,采用步骤S112确定核心词:
步骤S112,根据各个分词的聚类层级权重确定商品短文本的核心词。
例如,可以在确定商品短文本中各个分词的聚类层级权重后,将聚类层级权重的值最大的分词作为核心词。也可以根据需要,采用其他方法依据聚类层级权重的值确定核心词。
通过参考商品短文本中分词的上下文信息来获得该商品短文本的文档向量,可以弥补短文本信息量少的缺陷,使基于文档向量的聚类结果更准确,进而依据分词在其商品短文本所属聚类类别的权重可以更加准确地从商品短文本中提取出核心词。
在确定商品短文本的核心词时,除了参考分词的聚类层级权重以外,还可以结合其他权重对结果进行优化。例如,可以结合局部权重衡量分词对应的商品词性信息在单条商品短文本中的重要程度,或/和,还可以结合文档层级权重衡量分词在整个数据集内的重要程度。下面对聚类层级权重结合其他权重对结果进行优化的情形进行描述。
下面参考图2描述本发明另一个实施例的商品短文本核心词提取方法。
图2为本发明商品短文本核心词提取方法的另一个实施例的流程图。如图2所示,该实施例的方法包括:
首先,对数据集内的商品短文本进行预处理,得到各个商品短文本的分词。预处理的实现可以例如参考前文中步骤S102~S105的描述。
在预处理之后,除了确定分词的聚类层级权重(参考步骤S106~S110),还可以确定分词的局部权重,以衡量分词在单条商品短文本中的重要程度。例如,可以采用步骤S206~S210确定局部权重:
步骤S206,确定商品短文本中的各个分词的商品词性信息。
其中,在本发明实施例中,商品词性信息是指引入具有商品特点的词性信息,例如品牌、系列名称、品类、属性词、款式、修饰词、书名、作者等等。同一个商品短文本中的同一个分词可以具有多种商品词性信息。
在采用本实施例的方法之前,可以预先设置商品词性信息字典,字典包括若干词语以及其对应的商品词性信息。在确定商品短文本中的各个分词的商品词性信息时,可以逐一查找分词在商品词性信息字典中对应的商品词性信息并进行标注。
步骤S208,根据商品词性和词性权重的对应关系确定商品短文本中的分词对应的词性权重。
词性权重指某一类别的商品词性信息在商品短文本中的重要程度。例如,在商品标题中,品牌和品类为相对重要的词语,而“500ml”、“2米”等属性词为重要程度较低的词语。因此,可以根据商品词性信息的重要性和具体的业务需求设置商品词性信息对应的词性权重。上述过程可以人工设置,也可以采用基于统计的方法确定。
步骤S210,根据商品短文本中的各个分词对应的词性权重确定各个分词的局部权重。
当一个分词具有多种商品词性信息时,可以将这些商品词性信息对应的若干词性权重进行累加。此外,确定局部权重时,可以直接将分词对应的词性权重作为局部权重,也可以将商品短文本中所有的分词的词性权重进行归一化处理,将归一化处理后的结果作为各个分词的局部权重。设x为归一化处理之前某分词的局部权重,y为归一化处理之后该分词的局部权重,min和max分别为该分词所属的商品短文本中局部权重的最小值和最大值,则y的值可以通过公式(2)进行归一化计算:
y=(x-min)/(max-min)  (2)
通过归一化的处理,能够直观地反映各个分词的局部权重的统计分布特性。
然后,采用步骤S212确定核心词:
步骤S212,根据各个分词的聚类层级权重和局部权重确定商品短文本的核心词。
在聚类层级权重的基础上,通过结合分词的局部权重,可以进一步针对商品短文本的特点,从其商品词性信息的角度对结果进行优化,提高了核心词确定的准确性。
本发明还提供了基于统计方式建立商品词性信息与词性权重的对应关系的方法。
一种示例性的方法,首先,获取训练语料,训练语料包括若干用于训练的商品短文本;然后,标注各个用于训练的商品短文本中的核心词和核心词对应的商品词性信息,例如,可以通过线下字典标注核心词及其商品词性,利用字典中记录的核心词及其对应的商品词性,对训练预料中的核心词进行商品词性的标注;最后,将具有相同商品词性信息的核心词的数量与训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重。
另一种示例性的方法,还可以根据用户的搜索行为自动标注核心词。首先,建立用户搜索时采用的搜索词和点击的商品短文本的对应关系;然后,将搜索词标注为对应的商品短文本的核心词,并标注核心词对应的商品词性信息;最后,将具有相同商品词性信息的核心词的数量与训练语料中所有核心词的数量的比值作为该商品词性信 息对应的词性权重。其中,建立用户搜索时采用的搜索词和点击的商品短文本的对应关系并标注核心词时,可以通过搜索点击日志建立搜索词和用户点击的商品短文本的对应关系并进行标注。例如,根据搜索点击日志,用户使用搜索词“手机”搜索商品,搜索结果为“Apple iPhone 6s 64G深空灰色电信4G手机”以及“小米Note白色移动4G手机”,而用户点击了前者,即“Apple iPhone 6s 64G深空灰色电信4G手机”,则将“手机”标注为“Apple iPhone 6s 64G深空灰色电信4G手机”的核心词。
通过上述基于统计的方法确定商品词性信息对应的词性权重,使词性权重的值更能适用于当前的使用环境,从而提高了核心词提取的准确性。
下面参考图3A描述本发明又一个实施例的商品短文本核心词提取方法。
图3A为本发明商品短文本核心词提取方法的又一个实施例的流程图。如图3A所示,该实施例的方法包括:
首先,对数据集内的商品短文本进行预处理,得到各个商品短文本的分词。预处理的实现可以例如参考前文中步骤S102~S105的描述。
在预处理之后,除了确定分词的聚类层级权重(参考步骤S106~S110),还可以确定分词的文档层级权重,以衡量分词在数据集的所有商品短文本中的重要程度。例如,可以采用步骤S306确定文档层级权重:
步骤S306,确定商品短文本中的各个分词在数据集内的逆向文件频率,将其确定为各个分词的文档层级权重。
逆向文件频率(IDF或idf,inverse document frequency)表示具有某一分词的文档在语料库中的出现频率,该频率越高,则对应的idf值越小。即,如果一个分词在大量不同的文档中出现,说明该分词无法代表某一文档的特性。由于每条商品短文本相当于一个文档,因此商品短文本中的各个分词在数据集内的逆向文件频率idf值可以采用公式(3)进行计算:
Figure PCTCN2017072157-appb-000002
其中,M表示商品短文总数量,L表示包含某一分词的商品短文本数量,idf值表示商品短文本中该分词的分档层级权重。
然后,采用步骤S312a确定核心词:
步骤S312a,根据各个分词的聚类层级权重和文档层级权重确定商品短文本的核心词。
在聚类层级权重的基础上,通过结合分词的文档层级权重,能够进一步从数据集层面的重要性对核心词提取过程进行优化,从而提升了核心词确定的准确性。
显然,根据图1-3描述的核心词提取方法,还可以结合各个实施例分别得到聚类层级权重、局部权重和文档层级权重,然后根据聚类层级权重、局部权重和文档层级权重共同作为核心词的确定依据,即,执行步骤S312b,根据各个分词的聚类层级权重、局部权重和文档层级权重确定商品短文本的核心词。上述商品短文本核心词提取方法例如可以参考图3B。
当根据一种以上的权重确定核心词时,可以首先将各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重加权求和作为各个分词的核心权重,再将商品短文本中核心权重的值最大的分词确定为商品短文本的核心词。下面参考图4描述本发明确定各个权重的加权系数的方法。
图4为本发明确定各个权重的加权系数的方法的流程图。如图4所示,该实施例的方法包括:
步骤S402,对训练数据集内的各个商品短文本进行分词处理。
步骤S404,标注训练数据集内各个商品短文本的核心词和非核心词。
例如,核心词与非核心词的标注可以采用人工手动标注的方式;也可以根据搜索点击日志,将用户搜索时采用的搜索词作为用户点击的搜索结果对应的核心词。
步骤S406,计算训练数据集内各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重。
其中,具体的计算方法采用前述各个实施例中计算各个分词的局部权重、文档层级权重中和聚类层级权重的方法。
步骤S408,将核心词作为正样本,非核心词作为负样本,根据各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重,采用机器学习的方法,例如线性回归、决策树、神经网络等算法,计算局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数。
通过采用上述方法,当根据多个权重确定商品短文本中的核心词时,能够调节各个权重之间的比重,提升核心词确定的准确性。
显然,根据需要,也可以直接将不同的权重进行直接求和作为核心权重,即各个权重的加权系数均为1,从而方便计算。
下面参考图5描述本发明一个实施例的商品短文本核心词提取装置。
图5为本发明商品短文本核心词提取装置的一个实施例的结构图。如图5所示,该实施例的装置包括:预处理模块52、权重确定模块54和核心词确定模块56。预处理模块52包括:短文本获取单元522,用于获取数据集内的商品短文本;分词单元524,用于对商品短文本进行分词处理。权重确定模块54包括聚类层级权重确定子模块542,用于确定商品短文本中各个分词的权重。聚类层级权重确定子模块542包括:文档向量确定单元5422,用于根据商品短文本的分词上下文信息获得商品短文本的文档向量;聚类单元5424,用于根据文档向量对数据集内的商品短文本进行聚类;聚类层级权重确定单元5426,用于确定商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重。核心词确定模块56用于根据各个分词的聚类层级权重确定商品短文本的核心词。
通过采用文档向量确定单元参考商品短文本中分词的上下文信息来获得该商品短文本的文档向量,可以弥补短文本信息量少的缺陷,使聚类单元基于文档向量的聚类结果更准确,进而聚类层级权重确定单元在其商品短文本所属聚类类别的权重可以更加准确地从商品短文本中提取出核心词。
其中,商品短文本可以包括商品标题或商品评论。
其中,文档向量确定单元5422可以用于根据数据集内商品短文本的平均长度确定窗口参数,采用向量运算工具word2vec,将数据集作为输入语料、确定的窗口参数作为窗口大小进行计算,得到数据集内的商品短文本的文档向量。
其中,聚类单元5424聚类的数量可以根据商品品类的数量确定,使基于文档向量的聚类结果更准确,进而依据分词在其商品短文本所属聚类类别的权重可以更加准确地从商品短文本中提取出核心词。
其中,聚类层级权重确定单元5426可以用于采用卡方公式计算商品短文本中的各个分词在商品短文本所属的类别的卡方值,将卡方值作为商品短文本中的各个分词在商品短文本所属的类别的聚类层级权重。
下面参考图6描述本发明另一个实施例的商品短文本核心词提取装置。
图6为本发明商品短文本核心词提取装置的另一个实施例的结构图。如图6所示,该实施例的权重确定模块54还包括局部权重确定子模块644和/或文档层级权重确定子模块646。
其中,局部权重确定子模块644包括:商品词性信息确定单元6442,用于确定商品短文本中的各个分词的商品词性信息;词性权重确定单元6444,用于根据商品词性和词性权重的对应关系确定商品短文本中的分词对应的词性权重;局部权重确定单元6446,用于根据商品短文本中的各个分词对应的词性权重确定各个分词的局部权重。在设置聚类层级权重确定子模块的基础上,通过结合局部权重确定子模块,可以进一步针对商品短文本的特点,从其商品词性信息的角度对结果进行优化,提高了核心词确定的准确性。
其中,文档层级权重确定子模块646用于确定商品短文本中的各个分词在数据集内的逆向文件频率,将其确定为各个分词的文档层级权重;核心词确定模块56用于根据各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重确定商品短文本的核心词。在设置聚类层级权重确定子模块的基础上,通过结合文档层级权重确定子模块,能够进一步从数据集层面的重要性对核心词提取过程进行优化,从而提升了核心词确定的准确性。
其中,商品词性信息可以包括品牌、系列名称、品类、名词、属性词、款式、修饰词中的一个或多个。
其中,局部权重确定单元6446可以用于对商品短文本中的各个分词对应的词性权重进行归一化处理,获得各个分词的局部权重。通过归一化的处理,能够直观地反映各个分词的局部权重的统计分布特性。
其中,核心词确定模块可以56包括:核心权重计算单元662,用于将各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重加权求和作为各个分词的核心权重;核心词确定单元664,用于将商品短文本中核心权重的值最大的分词确定为商品短文本的核心词。
下面参考图7描述本发明又一个实施例的商品短文本核心词提取装置。
图7为本发明商品短文本核心词提取装置的又一个实施例的结构图。如图7所示,局部权重确定子模块644还可以包括第一词性权重对应单元7442或者第二词性权重对应单元7444。第一词性权重对应单元7442用于获取包括若干用于训练的商品短文本的训练语料,并标注各个用于训练的商品短文本中的核心词和核心词对应的商品词性信息,将具有相同商品词性信息的核心词的数量与训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重。第二词性权重对应单元7444用于建立用户搜 索时采用的搜索词和点击的商品短文本的对应关系,将搜索词标注为对应的商品短文本的核心词,并标注核心词对应的商品词性信息,再将具有相同商品词性信息的核心词的数量与训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重。其中,第二词性权重对应单元7444例如可以通过搜索点击日志建立搜索词和用户点击的商品短文本的对应关系。
通过采用上述基于统计的第一词性权重对应单元或者第二词性权重对应单元确定商品词性信息对应的词性权重,能够使词性权重的值更能适用于当前的使用环境,从而提高了核心词提取的准确性。
此外,核心词确定模块56还可以包括加权系数确定单元762,用于确定各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数并发送给核心权重计算单元662。加权系数确定单元762包括:训练数据分词子单元7622,用于对训练数据集内的各个商品短文本进行分词处理;训练数据标注子单元7624,用于标注训练数据集内各个商品短文本的核心词和非核心词;分词权重计算子单元7626,用于计算训练数据集内各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重;机器学习子单元7628,用于将核心词作为正样本,非核心词作为负样本,根据各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重,采用机器学习的方法计算局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数。
通过采用加权系数确定单元,当根据多个权重确定商品短文本中的核心词时,能够调节各个权重之间的比重,提升核心词确定的准确性。
此外,预处理模块52还可以包括数据清理单元722,用于去掉商品短文本中的停用词和标点符号,并将处理后的商品短文本发送给分词单元524。从而,可以减少对非必要词汇的处理,提升后续处理的效率。
此外,预处理模块52还可以包括分词过滤单元724,用于在对商品短文本进行分词处理之后,统计数据集内所有分词的出现频率,去掉商品短文本中出现频率低于过滤阈值的分词。从而,可以提升后续核心词的提取效率。
图8为本发明商品短文本核心词提取装置的再一个实施例的结构图。如图8所示,该实施例的装置800包括:存储器810以及耦接至该存储器810的处理器820,处理器820被配置为基于存储在存储器810中的指令,执行前述任意一个实施例中的商品短文本核心词提取方法。
其中,存储器810例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。
图9为本发明商品短文本核心词提取装置的再一个实施例的结构图。如图9所示,该实施例的装置800包括:存储器810以及处理器820,还可以包括输入输出接口930、网络接口940、存储接口950等。这些接口930,940,950以及存储器810和处理器820之间例如可以通过总线960连接。其中,输入输出接口930为显示器、鼠标、键盘、触摸屏等输入输出设备提供连接接口。网络接口940为各种联网设备提供连接接口。存储接口950为SD卡、U盘等外置存储设备提供连接接口。
本领域内的技术人员应当明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解为可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则 之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (25)

  1. 一种商品短文本核心词提取方法,其特征在于,包括:
    获取数据集内的商品短文本;
    对所述商品短文本进行分词处理;
    根据所述商品短文本的分词上下文信息获得所述商品短文本的文档向量;
    根据所述文档向量对所述数据集内的商品短文本进行聚类;
    确定所述商品短文本中的各个分词在所述商品短文本所属的类别的聚类层级权重;
    根据所述各个分词的聚类层级权重确定所述商品短文本的核心词。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    确定所述商品短文本中的各个分词的商品词性信息,根据商品词性和词性权重的对应关系确定所述商品短文本中的分词对应的词性权重,并根据所述商品短文本中的各个分词对应的词性权重确定所述各个分词的局部权重;
    和/或,
    确定所述商品短文本中的各个分词在所述数据集内的逆向文件频率,将其确定为所述各个分词的文档层级权重;
    所述根据所述各个分词的聚类层级权重确定所述商品短文本的核心词包括:
    根据所述各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重确定所述商品短文本的核心词。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述商品短文本的分词上下文信息获得所述商品短文本的文档向量包括:
    根据所述数据集内商品短文本的平均长度确定窗口参数;
    采用向量运算工具word2vec,将所述数据集作为输入语料、确定的窗口参数作为窗口大小进行计算,得到所述数据集内的商品短文本的文档向量。
  4. 根据权利要求1或2所述的方法,其特征在于,其中,所述聚类的数量根据商品品类的数量确定。
  5. 根据权利要求1或2所述的方法,其特征在于,所述确定所述商品短文本中的各个分词在所述商品短文本所属的类别的聚类层级权重包括:
    采用卡方公式计算所述商品短文本中的各个分词在所述商品短文本所属的类别的 卡方值,将所述卡方值作为所述商品短文本中的各个分词在所述商品短文本所属的类别的聚类层级权重。
  6. 根据权利要求2所述的方法,其特征在于,采用以下方法确定所述商品词性及其词性权重的对应关系:
    获取训练语料,所述训练语料包括若干用于训练的商品短文本,并标注各个用于训练的商品短文本中的核心词和核心词对应的商品词性信息,将具有相同商品词性信息的核心词的数量与所述训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重;
    或者,
    建立用户搜索时采用的搜索词和点击的商品短文本的对应关系,将所述搜索词标注为对应的商品短文本的核心词,并标注所述核心词对应的商品词性信息,将具有相同商品词性信息的核心词的数量与所述训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重。
  7. 根据权利要求2所述的方法,其特征在于,所述根据所述商品短文本中的各个分词对应的词性权重确定所述各个分词的局部权重包括:
    对所述商品短文本中的各个分词对应的词性权重进行归一化处理,获得所述各个分词的局部权重。
  8. 根据权利要求2所述的方法,其特征在于,所述根据所述各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重确定所述商品短文本的核心词包括:
    将所述各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重加权求和作为所述各个分词的核心权重;
    将所述商品短文本中核心权重的值最大的分词确定为所述商品短文本的核心词。
  9. 根据权利要求2所述的方法,其特征在于,采用以下方法确定局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数:
    对训练数据集内的各个商品短文本进行分词处理;
    标注所述训练数据集内各个商品短文本的核心词和非核心词;
    计算训练数据集内各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重;
    将核心词作为正样本,非核心词作为负样本,根据各个分词的局部权重和文档层 级权重中的至少一个以及聚类层级权重,采用机器学习的方法计算局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数。
  10. 根据权利要求1所述的方法,其特征在于,还包括:在对所述商品短文本进行分词处理之前,去掉商品短文本中的停用词和标点符号。
  11. 根据权利要求1所述的方法,其特征在于,还包括:在对所述商品短文本进行分词处理之后,统计所述数据集内所有分词的出现频率,去掉商品短文本中出现频率低于过滤阈值的分词。
  12. 根据权利要求1所述的方法,其特征在于,其中,所述商品短文本包括商品标题、商品评论或商品信息页内容。
  13. 根据权利要求2所述的方法,其特征在于,其中,商品词性信息包括品牌、系列名称、品类、名词、属性词、款式、修饰词中的一个或多个。
  14. 一种商品短文本核心词提取装置,其特征在于,包括:
    预处理模块,所述预处理模块包括:
    短文本获取单元,用于获取数据集内的商品短文本;
    分词单元,用于对所述商品短文本进行分词处理;
    权重确定模块,包括聚类层级权重确定子模块,用于确定所述商品短文本中各个分词的权重;
    所述聚类层级权重确定子模块包括:
    文档向量确定单元,用于根据所述商品短文本的分词上下文信息获得所述商品短文本的文档向量;
    聚类单元,用于根据所述文档向量对所述数据集内的商品短文本进行聚类;
    聚类层级权重确定单元,用于确定所述商品短文本中的各个分词在所述商品短文本所属的类别的聚类层级权重;
    核心词确定模块,用于根据所述各个分词的聚类层级权重确定所述商品短文本的核心词。
  15. 根据权利要求14所述的装置,其特征在于,所述权重确定模块还包括局部权重确定子模块和/或文档层级权重确定子模块;
    其中,所述局部权重确定子模块包括:
    商品词性信息确定单元,用于确定所述商品短文本中的各个分词的商品词性 信息;
    词性权重确定单元,用于根据商品词性和词性权重的对应关系确定所述商品短文本中的分词对应的词性权重;
    局部权重确定单元,用于根据所述商品短文本中的各个分词对应的词性权重确定所述各个分词的局部权重;
    其中,所述文档层级权重确定子模块用于确定所述商品短文本中的各个分词在所述数据集内的逆向文件频率,将其确定为所述各个分词的文档层级权重;
    所述核心词确定模块用于根据所述各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重确定所述商品短文本的核心词。
  16. 根据权利要求14所述的装置,其特征在于,所述文档向量确定单元用于根据所述数据集内商品短文本的平均长度确定窗口参数,采用向量运算工具word2vec,将所述数据集作为输入语料、确定的窗口参数作为窗口大小进行计算,得到所述数据集内的商品短文本的文档向量。
  17. 根据权利要求14或15所述的装置,其特征在于,其中,所述聚类单元聚类的数量根据商品品类的数量确定。
  18. 根据权利要求14或15所述的装置,其特征在于,所述聚类层级权重确定单元用于采用卡方公式计算所述商品短文本中的各个分词在所述商品短文本所属的类别的卡方值,将所述卡方值作为所述商品短文本中的各个分词在所述商品短文本所属的类别的聚类层级权重。
  19. 根据权利要求15所述的装置,其特征在于,所述局部权重确定子模块还包括第一词性权重对应单元或者第二词性权重对应单元;
    其中,所述第一词性权重对应单元用于获取包括若干用于训练的商品短文本的训练语料,并标注各个用于训练的商品短文本中的核心词和核心词对应的商品词性信息,将具有相同商品词性信息的核心词的数量与所述训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重;
    其中,所述第二词性权重对应单元用于建立用户搜索时采用的搜索词和点击的商品短文本的对应关系,将所述搜索词标注为对应的商品短文本的核心词,并标注所述核心词对应的商品词性信息,再将具有相同商品词性信息的核心词的数量与所述训练语料中所有核心词的数量的比值作为该商品词性信息对应的词性权重。
  20. 根据权利要求15所述的装置,其特征在于,所述局部权重确定单元用于对所述商品短文本中的各个分词对应的词性权重进行归一化处理,获得所述各个分词的局部权重。
  21. 根据权利要求15所述的装置,其特征在于,所述核心词确定模块包括:
    核心权重计算单元,用于将所述各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重加权求和作为所述各个分词的核心权重;
    核心词确定单元,用于将所述商品短文本中核心权重的值最大的分词确定为所述商品短文本的核心词。
  22. 根据权利要求15所述的装置,其特征在于,所述核心词确定模块还包括加权系数确定单元,用于确定所述各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数并发送给所述核心权重计算单元,所述加权系数确定单元包括:
    训练数据分词子单元,用于对训练数据集内的各个商品短文本进行分词处理;
    训练数据标注子单元,用于标注所述训练数据集内各个商品短文本的核心词和非核心词;
    分词权重计算子单元,用于计算训练数据集内各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重;
    机器学习子单元,用于将核心词作为正样本,非核心词作为负样本,根据各个分词的局部权重和文档层级权重中的至少一个以及聚类层级权重,采用机器学习的方法计算局部权重和文档层级权重中的至少一个以及聚类层级权重的加权系数。
  23. 根据权利要求14所述的装置,其特征在于,所述预处理模块还包括数据清理单元,和/或,分词过滤单元;
    其中,所述数据清理单元,用于去掉商品短文本中的停用词和标点符号,并将处理后的所述商品短文本发送给所述分词单元;
    其中,所述分词过滤单元,用于在对所述商品短文本进行分词处理之后,统计所述数据集内所有分词的出现频率,去掉商品短文本中出现频率低于过滤阈值的分词。
  24. 根据权利要求15所述的装置,其特征在于,其中,商品词性信息包括品牌、系列名称、品类、名词、属性词、款式、修饰词中的一个或多个;或者,所述商品短文本包括商品标题、商品评论或商品信息页内容。
  25. 一种商品短文本核心词提取装置,其特征在于,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求1-13中任一项所述的商品短文本核心词提取方法。
PCT/CN2017/072157 2016-03-30 2017-01-23 商品短文本核心词提取方法和装置 WO2017166912A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU2017243270A AU2017243270B2 (en) 2016-03-30 2017-01-23 Method and device for extracting core words from commodity short text
US16/089,579 US11138250B2 (en) 2016-03-30 2017-01-23 Method and device for extracting core word of commodity short text

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610191177.3 2016-03-30
CN201610191177.3A CN105808526B (zh) 2016-03-30 2016-03-30 商品短文本核心词提取方法和装置

Publications (1)

Publication Number Publication Date
WO2017166912A1 true WO2017166912A1 (zh) 2017-10-05

Family

ID=56454267

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/072157 WO2017166912A1 (zh) 2016-03-30 2017-01-23 商品短文本核心词提取方法和装置

Country Status (4)

Country Link
US (1) US11138250B2 (zh)
CN (1) CN105808526B (zh)
AU (1) AU2017243270B2 (zh)
WO (1) WO2017166912A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241275A (zh) * 2020-01-02 2020-06-05 厦门快商通科技股份有限公司 一种短文本相似度评估方法和装置以及设备

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808526B (zh) 2016-03-30 2019-07-30 北京京东尚科信息技术有限公司 商品短文本核心词提取方法和装置
CN107784019A (zh) * 2016-08-30 2018-03-09 苏宁云商集团股份有限公司 一种搜索业务中搜索词处理方法及系统
CN106598949B (zh) * 2016-12-22 2019-01-04 北京金山办公软件股份有限公司 一种词语对文本贡献度的确定方法及装置
CN106649276B (zh) * 2016-12-29 2019-02-26 北京京东尚科信息技术有限公司 标题中核心产品词的识别方法以及装置
CN106886934B (zh) * 2016-12-30 2018-01-23 北京三快在线科技有限公司 用于确定商家品类的方法、系统和装置
CN107562793A (zh) * 2017-08-01 2018-01-09 佛山市深研信息技术有限公司 一种大数据挖掘方法
CN107730346A (zh) * 2017-09-25 2018-02-23 北京京东尚科信息技术有限公司 物品聚类的方法和装置
WO2019084558A1 (en) * 2017-10-27 2019-05-02 Google Llc SELECTING RESPONSE INTERVALS FROM ELECTRONIC DOCUMENTS USING AUTOMATIC APPRENTICESHIP
CN107862046B (zh) * 2017-11-07 2019-03-26 宁波爱信诺航天信息有限公司 一种基于短文本相似度的税务商品编码分类方法及系统
CN110196742A (zh) * 2018-02-27 2019-09-03 阿里巴巴集团控股有限公司 生成、展示数据对象信息的方法及装置
CN108899014B (zh) * 2018-05-31 2021-06-08 中国联合网络通信集团有限公司 语音交互设备唤醒词生成方法及装置
CN110633464A (zh) * 2018-06-22 2019-12-31 北京京东尚科信息技术有限公司 一种语义识别方法、装置、介质及电子设备
CN110889285B (zh) * 2018-08-16 2023-06-16 阿里巴巴集团控股有限公司 确定核心词的方法、装置、设备和介质
CN110968685B (zh) * 2018-09-26 2023-06-20 阿里巴巴集团控股有限公司 商品名称的归集方法和装置
CN109710762B (zh) * 2018-12-26 2023-08-01 南京云问网络技术有限公司 一种融合多种特征权重的短文本聚类方法
CN111078884B (zh) * 2019-12-13 2023-08-15 北京小米智能科技有限公司 一种关键词提取方法、装置及介质
US11900395B2 (en) * 2020-01-27 2024-02-13 Ncr Voyix Corporation Data-driven segmentation and clustering
CN112016298A (zh) * 2020-08-28 2020-12-01 中移(杭州)信息技术有限公司 产品特征信息的提取方法、电子设备及存储介质
CN113392651B (zh) * 2020-11-09 2024-05-14 腾讯科技(深圳)有限公司 训练词权重模型及提取核心词的方法、装置、设备和介质
CN112508612B (zh) * 2020-12-11 2024-02-27 北京搜狗科技发展有限公司 训练广告创意生成模型、生成广告创意的方法及相关装置
CN112685440B (zh) * 2020-12-31 2022-03-22 上海欣兆阳信息科技有限公司 标记搜索语义角色的结构化查询信息表达方法
CN112860898B (zh) * 2021-03-16 2022-05-27 哈尔滨工业大学(威海) 一种短文本框聚类方法、系统、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136300A (zh) * 2011-12-05 2013-06-05 北京百度网讯科技有限公司 一种文本相关主题的推荐方法和装置
CN103186662A (zh) * 2012-12-28 2013-07-03 中联竞成(北京)科技有限公司 一种动态舆情关键词抽取系统和方法
CN103646074A (zh) * 2013-12-11 2014-03-19 北京奇虎科技有限公司 一种确定图片簇描述文本核心词的方法及装置
CN104866572A (zh) * 2015-05-22 2015-08-26 齐鲁工业大学 一种网络短文本聚类方法
CN105808526A (zh) * 2016-03-30 2016-07-27 北京京东尚科信息技术有限公司 商品短文本核心词提取方法和装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11120183A (ja) * 1997-10-08 1999-04-30 Ntt Data Corp キーワード抽出方法及び装置
US6574632B2 (en) * 1998-11-18 2003-06-03 Harris Corporation Multiple engine information retrieval and visualization system
US9275129B2 (en) * 2006-01-23 2016-03-01 Symantec Corporation Methods and systems to efficiently find similar and near-duplicate emails and files
CN101184259B (zh) * 2007-11-01 2010-06-23 浙江大学 垃圾短信中的关键词自动学习及更新方法
CN102063469B (zh) * 2010-12-03 2013-04-24 百度在线网络技术(北京)有限公司 一种用于获取相关关键词信息的方法、装置和计算机设备
US8719257B2 (en) * 2011-02-16 2014-05-06 Symantec Corporation Methods and systems for automatically generating semantic/concept searches
KR102131099B1 (ko) * 2014-02-13 2020-08-05 삼성전자 주식회사 지식 그래프에 기초한 사용자 인터페이스 요소의 동적 수정 방법
US9317498B2 (en) * 2014-05-23 2016-04-19 Codeq Llc Systems and methods for generating summaries of documents
CN104008186B (zh) * 2014-06-11 2018-10-16 北京京东尚科信息技术有限公司 从目标文本中确定关键词的方法和装置
CN104834747B (zh) 2015-05-25 2018-04-27 中国科学院自动化研究所 基于卷积神经网络的短文本分类方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136300A (zh) * 2011-12-05 2013-06-05 北京百度网讯科技有限公司 一种文本相关主题的推荐方法和装置
CN103186662A (zh) * 2012-12-28 2013-07-03 中联竞成(北京)科技有限公司 一种动态舆情关键词抽取系统和方法
CN103646074A (zh) * 2013-12-11 2014-03-19 北京奇虎科技有限公司 一种确定图片簇描述文本核心词的方法及装置
CN104866572A (zh) * 2015-05-22 2015-08-26 齐鲁工业大学 一种网络短文本聚类方法
CN105808526A (zh) * 2016-03-30 2016-07-27 北京京东尚科信息技术有限公司 商品短文本核心词提取方法和装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241275A (zh) * 2020-01-02 2020-06-05 厦门快商通科技股份有限公司 一种短文本相似度评估方法和装置以及设备

Also Published As

Publication number Publication date
US20200311113A1 (en) 2020-10-01
CN105808526B (zh) 2019-07-30
AU2017243270A1 (en) 2018-11-08
US11138250B2 (en) 2021-10-05
CN105808526A (zh) 2016-07-27
AU2017243270B2 (en) 2022-06-30

Similar Documents

Publication Publication Date Title
WO2017166912A1 (zh) 商品短文本核心词提取方法和装置
WO2019200806A1 (zh) 文本分类模型的生成装置、方法及计算机可读存储介质
CN108959431B (zh) 标签自动生成方法、系统、计算机可读存储介质及设备
TWI518528B (zh) Method, apparatus and system for identifying target words
WO2019214245A1 (zh) 一种信息推送方法、装置、终端设备及存储介质
WO2019223103A1 (zh) 文本相似度的获取方法、装置、终端设备及介质
CN109165294B (zh) 一种基于贝叶斯分类的短文本分类方法
WO2017167067A1 (zh) 网页文本分类的方法和装置,网页文本识别的方法和装置
WO2022095374A1 (zh) 关键词抽取方法、装置、终端设备及存储介质
CN108255813B (zh) 一种基于词频-逆文档与crf的文本匹配方法
CN111104466A (zh) 一种海量数据库表快速分类的方法
CN108021651B (zh) 一种网络舆情风险评估方法及装置
CN111104526A (zh) 一种基于关键词语义的金融标签提取方法及系统
CN108027814B (zh) 停用词识别方法与装置
CN104361037B (zh) 微博分类方法及装置
CN108596637B (zh) 一种电商服务问题自动发现系统
CN108287911A (zh) 一种基于约束化远程监督的关系抽取方法
CN112069312B (zh) 一种基于实体识别的文本分类方法及电子装置
CN110990532A (zh) 一种处理文本的方法和装置
WO2018176913A1 (zh) 搜索方法、装置及非临时性计算机可读存储介质
CN108228612B (zh) 一种提取网络事件关键词以及情绪倾向的方法及装置
CN106681985A (zh) 基于主题自动匹配的多领域词典构建系统
CN112784063A (zh) 一种成语知识图谱构建方法及装置
TW202111569A (zh) 高擴展性、多標籤的文本分類方法和裝置
CN113806483A (zh) 数据处理方法、装置、电子设备及计算机程序产品

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017243270

Country of ref document: AU

Date of ref document: 20170123

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17772943

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 21.01.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17772943

Country of ref document: EP

Kind code of ref document: A1