WO2017157200A1 - 特征词汇提取方法及装置 - Google Patents

特征词汇提取方法及装置 Download PDF

Info

Publication number
WO2017157200A1
WO2017157200A1 PCT/CN2017/075831 CN2017075831W WO2017157200A1 WO 2017157200 A1 WO2017157200 A1 WO 2017157200A1 CN 2017075831 W CN2017075831 W CN 2017075831W WO 2017157200 A1 WO2017157200 A1 WO 2017157200A1
Authority
WO
WIPO (PCT)
Prior art keywords
phrase
description information
spelling
vocabulary
network resource
Prior art date
Application number
PCT/CN2017/075831
Other languages
English (en)
French (fr)
Inventor
张增明
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017157200A1 publication Critical patent/WO2017157200A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to a feature vocabulary extraction method and apparatus.
  • e-commerce websites in addition to collecting structured information of products, such as categories, titles, prices, etc., e-commerce websites also need to collect other words that can reflect the characteristics of the products, such as styles (sleeve, sleeveless, long sleeves). , v-neck, round neck, etc.), style (wave point, houndstooth, etc.), materials (polyester, silk, etc.), etc., to enrich the basic characteristics of the product data.
  • styles seleeve, sleeveless, long sleeves.
  • style wave point, houndstooth, etc.
  • materials polyyester, silk, etc.
  • the common method for an e-commerce website to obtain the characteristic vocabulary of a product is to guide the merchant to fill in the vocabulary that best reflects the characteristics of the product when the merchant publishes the product. Since not every merchant is willing and able to fill in the characteristic vocabulary of each commodity, this method is used to obtain the characteristic vocabulary of the commodity, and there are problems such as insufficient number of phrases and poor quality of the phrase.
  • aspects of the present application provide a feature vocabulary extraction method and apparatus for ensuring the number of feature vocabularies and improving the quality of feature vocabulary.
  • An aspect of the present application provides a feature vocabulary extraction method, including:
  • a feature vocabulary extracting apparatus including:
  • An obtaining module configured to obtain description information of a network resource as an extracted corpus
  • a marking module configured to perform phrase tagging on the extracted corpus to obtain a phrase tag result
  • An extracting module configured to extract, from the phrase tagging result, a feature word that can reflect the feature of the network resource exchange.
  • the description information of the network resource is used as the extracted corpus, and the extracted corpus is tagged, and then the feature vocabulary that can reflect the characteristics of the network resource is extracted therefrom, and the scheme for guiding the network resource provider to manually fill in the feature vocabulary in the prior art
  • the influence of the subjective factors of the network resource provider is eliminated, and not only a sufficient number of feature words can be extracted, but also the quality of the feature words can be guaranteed.
  • FIG. 1 is a schematic flowchart of a feature vocabulary extraction method according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart diagram of a feature vocabulary extraction method according to another embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart diagram of a feature vocabulary extraction method according to another embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of a feature vocabulary extracting apparatus according to another embodiment of the present disclosure.
  • FIG. 5 is a schematic structural diagram of a feature vocabulary extracting apparatus according to another embodiment of the present disclosure.
  • FIG. 1 is a schematic flowchart diagram of a feature vocabulary extraction method according to an embodiment of the present application. As shown in Figure 1, the method includes:
  • the embodiment provides a feature vocabulary extraction method, which can be executed by the feature vocabulary extraction device, and extracts feature vocabulary that can reflect the characteristics of the network resource from the description information of the network resource, thereby ensuring the extracted feature words.
  • the quantity and quality of the sink can be executed by the feature vocabulary extraction device, and extracts feature vocabulary that can reflect the characteristics of the network resource from the description information of the network resource, thereby ensuring the extracted feature words.
  • the quantity and quality of the sink can be executed by the feature vocabulary extraction device, and extracts feature vocabulary that can reflect the characteristics of the network resource from the description information of the network resource, thereby ensuring the extracted feature words.
  • the quantity and quality of the sink can be executed by the feature vocabulary extraction device, and extracts feature vocabulary that can reflect the characteristics of the network resource from the description information of the network resource, thereby ensuring the extracted feature words.
  • the quantity and quality of the sink can be executed by the feature vocabulary extraction device, and extracts feature vocabulary that can reflect the characteristics of the network resource from the description information of the network resource, thereby ensuring the extracted
  • the characteristic vocabulary of network resources can also be called the highlight vocabulary of network resources. It is the vocabulary that best describes the style, characteristics and other characteristics of network resources, and can even be used as a label for network resources.
  • a feature vocabulary can express a complete meaning, which can be a word or a phrase. Among them, a phrase has a more effective expression than a word, so a phrase is more suitable as a feature vocabulary, but is not limited thereto.
  • the content of the network resource and related features can be quickly learned through the characteristic vocabulary of the network resource. For example, if the network resource is clothing, the phrase long-sleeve (long sleeve) describes that the dress is long-sleeved and can be used as a characteristic vocabulary of the clothing.
  • the first step in extracting feature vocabulary is to prepare the corpus for extraction.
  • the description information of the network resource is obtained as the extracted corpus.
  • the description information of the network resource is mainly information related to the network resource, and may include, for example but not limited to, at least one of a title of the network resource, attribute information, keywords, detailed information, and comment information. It should be noted that the description information of the network resource is preferably text information, but is not limited thereto.
  • the attribute information of the network resource may be manually filled in by the network resource provider when publishing the network resource, for example, but not limited to: length, size, origin, style, jewelry, and the like.
  • the title and keyword of the above network resource are also manually filled in by the network resource provider when publishing the network resource.
  • the above network resources may be goods or services.
  • the title, attribute information, and title of the network resource are actually the title, attribute information, and keyword of the product.
  • attribute information for an item is as follows:
  • the description information of the network resources extracted in this embodiment is very large, and can even reach a level of 100 million.
  • the original description information of the network resource may be obtained from the data warehouse, and the original description information is directly used as the extracted corpus.
  • the above step 101 includes:
  • the original description information of the foregoing network resource includes: at least one of a title, an attribute information, a keyword, a detailed information, and a comment information of the network resource; and the standardized description information of the network resource includes: the network after the text processing At least one of a title of the resource, attribute information, keywords, detailed information, and comment information.
  • text preprocessing of the original description information includes but is not limited to:
  • the original description information is subjected to at least one of a connection symbol reservation process, a case conversion process, a spelling consistency check process, a word segmentation process, a spelling error correction process, and a noun form restoration process.
  • the short-term "-" through analysis, is that the symbol is generally added by the network resource provider in the process of filling in the description information of the network resource, often connecting two or more related words, the network resource provider. You may wish to connect these words together to express a richer semantics. Take o-neck as an example, it is a correct spelling, expressing the meaning of "round neck”. If the short line "-" is removed, it becomes oneck. This is a misspelling and may be corrected in the subsequent error correction process. For the neck, it loses its original meaning.
  • the percent sign "%” may be used to indicate the ingredient content in some cases.
  • the percent sign in "100% cotton” indicates that the cotton content is 100%, so it should be retained. Of course, for cases that do not need to be retained, you can remove them.
  • the percent sign in "v-neck%” is not required to be retained and deleted.
  • the single quotation mark "'" may express the affiliation in some cases.
  • the single quotation marks in girls' indicate the affiliation and need to be retained. Of course, for cases that do not need to be retained, you can remove them.
  • the single quotes in shoulder' are redundant and will be deleted.
  • the format that needs to be reserved can be specified in advance.
  • Short lines, single quotes or percent signs Short lines, single quotes and percent signs and other connection symbols that do not conform to the specified format exist in the original description information are deleted.
  • the same word may appear in different spellings in different places.
  • the word dresses has the following different spelling methods (incomplete statistics): dresses, dr-esses, dress-es, etc. This inconsistency in spelling can also cause difficulties in later analysis and affect the quality of the extracted feature words. Based on this, the original description information is previously spelled consistency check processing, and these spell-inconsistent words are converted into a consistent spelling manner.
  • the number of times each spelling manner is repeated in the data warehouse is counted; according to each spelling manner The number of repetitions in the data warehouse. From multiple spellings, select the spelling that is repeated the most and greater than the preset threshold as the target spelling, and the other spellings that appear in the original description. Replace with the target spelling.
  • dresses For example, suppose the word dresses has three spellings of dresses, dr-esses and dress-es in the original description. After statistics, it is found that the appearances of dresses appear in the data warehouse are the most and greater than the preset number of thresholds. With dresses as the target spelling, replace the dr-esses and dress-es spellings with dresses.
  • connection symbol reservation processing and the case conversion processing is further processed by the spelling consistency check processing to be:
  • a specific correction method is to perform word segmentation processing on the original description information, that is, to identify the words that are written together in the original description information, and to segment the recognized words that are written together.
  • results of the word segmentation process are as follows:
  • the left side of "->” is the word that is written together, and the right side of "->" is the result of the split.
  • the string to be processed is composed of words, and after word segmentation, each word is segmented.
  • the segmented words are more consistent with the semantics of the context by adopting an optimal segmentation strategy.
  • the process of word segmentation is to eliminate the interference characters before and after, identify the words, and determine the optimal segmentation strategy in combination with the context to make the semantics smoother.
  • the spelling correction is to correct the wrong spelling form to the correct form. For example, modify the slit to sleeve. It is worth noting that the spelling correction here is for any string.
  • the string here can be a word or multiple words. In this way, spelling correction can not only correct misspelled words, but also Correct phrases that are formed by multiple words but are misspelled.
  • results of the spelling correction process are as follows:
  • word segmentation process and the spelling error correction process are separately described above. In practical applications, the two can also be combined. Since some words are written together, it is also possible to write a mistake, such as "longsleve” in the title example above. This wrong spelling cannot be directly split by words, so it needs to be corrected first, for example, it will be revised to " Longsleeve”. After the correction, you can divide the word into the correct form "long sleeve".
  • the combination of word segmentation and spelling correction can solve many problems that cannot be solved by a single technology, and improve the effect of data preprocessing.
  • the noun form reduction mainly refers to the reduction of the nouns in the original description information, that is, the plural nouns become singular.
  • the past tense of the gerund or verb may be an adjective, and it is possible to express a specific meaning, so the morphological restoration of the verb and the adjective is not considered at present.
  • the nouns in the original description information may be reduced according to at least one of a dictionary and a preset singular and plural conversion rule.
  • the dictionary-based noun form reduction method is more violent, but more reliable.
  • the specific method is: obtain all the nouns and their plural forms from the dictionary, construct the mapping relationship between the nouns and their plural forms, and then identify the nouns in the original description information based on the mapping relationship, and restore them to nouns. Odd number.
  • the noun morphological transformation rule based on the preset singular-complex transformation rule pre-sets the noun singular-complex transformation rule. For example, the method of noun becoming plural is generally followed by "s", and the last character is "y" becomes “ies”. Then, based on the transformation rule, the plural nouns in the original description information are identified, and the recognized plural nouns are inversely processed according to the transformation rule to be reduced to the noun singular number.
  • the noun form reduction process can be preferentially based on the dictionary. If the dictionary cannot be reduced to a noun singular number, further, the noun form reduction process is performed based on the singular and plural conversion rules. Generally speaking, the accuracy of the dictionary is relatively high, and the coverage of the rules is relatively wide. The combination of the two can not only ensure the accuracy of the noun form reduction, but also ensure that the noun plural is reduced to a noun singular as much as possible. Of course, one of the above two noun forms can be used.
  • the original description information will become normalized after being preprocessed by the above various texts, and will be The description information in is called standardized description information.
  • the standardized description information can be used as an extracted corpus to perform feature vocabulary extraction processing.
  • the foregoing implementation manner of performing phrase tagging on the extracted corpus includes:
  • the tagged phrases are tagged in the extracted corpus to obtain the phrase tag results.
  • the phrase extraction is performed from two aspects.
  • the display phrase extraction process is performed on the extracted corpus to extract the display phrase from the extracted corpus
  • the corpus execution mode phrase extraction process is extracted to extract implicit phrases from the extracted corpus.
  • the display phrase refers to a phrase that is easy to find
  • the implicit phrase refers to a phrase that is not easy to find. It can be seen that the present embodiment can extract the display phrase and extract the implicit phrase, so that the phrase can be extracted more comprehensively. In addition, the present embodiment performs phrase extraction based on the massive description information, and does not depend on the manual, so that errors caused by the manual can be avoided, and the quality of the phrase can be guaranteed.
  • the embodiment does not limit the execution order between the operation of extracting the displayed phrase and the operation of extracting the implicit phrase, and may be performed in any order or in parallel.
  • the above display phrase extraction process includes: a step of loading a preset display phrase rule and a step of extracting a display phrase from the extracted corpus according to the displayed phrase rule. Based on this, the above implementation manner of performing a display phrase extraction process on the extracted corpus to extract a display phrase from the extracted corpus includes:
  • an information segment that conforms to the displayed phrase rule is extracted as a display phrase.
  • the foregoing display phrase rule includes, but is not limited to, specifying at least one of a string condition rule, a domain dictionary rule, and an attribute value rule.
  • the above specified string condition rule is used to indicate that a string that meets the specified string condition can be used as a display phrase.
  • the above-described domain dictionary rule is used to indicate that a phrase belonging to a domain dictionary can be used as a display phrase. According to the field Different, the domain dictionary will be different. For example, in the field of clothing, the English-Chinese Textile Dictionary can be counted as a domain dictionary.
  • the above attribute value rule is used to indicate that the attribute value in the attribute information of the network resource can be used as a display phrase.
  • the information segment that meets the display phrase rule is extracted from the extracted corpus as the display phrase, and specifically includes at least one of the following operations:
  • the attribute value in the attribute information is extracted as a display phrase.
  • the short-term "-" connection string in the extracted corpus of the network resource for example, package-hip-dress, v-neck, o-neck, wholesale-retail, one-shoulder, long-sleeve, etc.
  • a string concatenated with a short "-” generally connects multiple words together and can express a richer meaning, so the short-line "-" connected string is a higher probability of a phrase.
  • there are some strings that are connected by a short "-” because they do not have actual meaning, so they cannot be used as phrases. For example, a-b, v-neck-half-sleeve-dress, etc. do not belong to a phrase.
  • string conditions include but are not limited to the following conditions:
  • Strings are concatenated with a short "-": This condition is used to qualify a string that must be concatenated with a dash "-" to be a phrase, where a string concatenated with a short "-” can be called a token;
  • the number of occurrences of the string is greater than the preset number of thresholds: This condition requires that the number of occurrences of the string be greater than the preset number of thresholds, for example, greater than 500 times; where the number of occurrences of the string refers to the occurrence of the string in the data warehouse. frequency;
  • the string is not an English word: this condition is used to exclude words, ie words are not phrases;
  • the string does not contain conjunctions: this condition is mainly used to avoid conjunctions in the phrase (such as,, but, or, for, so, nor, etc.);
  • stop words in the string there are no stop words in the string: this condition is mainly used to avoid stop words (such as of, a, etc.) in the phrase;
  • a string contains a specified number of words: This condition means that the string must contain a specified number of words in order to be a phrase, otherwise it cannot be a phrase;
  • the string does not contain numbers (except for percent): this condition means that a string containing a number cannot be a phrase;
  • the length of the word in the string is less than the specified length (for example, less than 20 letters): This condition means that the length of the word in the string is less than the specified length to become a phrase, and vice versa;
  • the length of the string is greater than the number of words contained in the string: this condition means that the length of the string is greater than the number of words contained in the string to become a phrase, and vice versa;
  • the string does not satisfy the specified regular rule: this condition means that a string that does not satisfy the specified regular rule can become a phrase. Conversely, a rule that satisfies the regular rule cannot be a phrase.
  • the regular rules here include but are not limited to: "as- ⁇ w+”, which means a string starting with "as-” and "so- ⁇ w+”, which means a string starting with "so-”.
  • Dress-es The last word ends with s or es;
  • Full-sleevevneckdresssexyclubwear the length of the word in the string exceeds the specified length
  • A-b the length of the string is not greater than the number of words contained in the string
  • Half-3sleeve the string contains the number 3;
  • V-neck-half-sleeve-dress The string contains too many words
  • So-good The string satisfies the specified regular rule.
  • the extracted corpus may include, but is not limited to, a title of a network resource, attribute information, and keywords.
  • the title, attribute information, and keywords of the network resource may be integrated into one information set, and then the condition of the specified string is extracted from the information set.
  • a string is used as a display phrase.
  • a character string satisfying the condition of the specified character string as a display phrase a character string satisfying the condition of the specified character string may be separately extracted from the title of the network resource as a display phrase, and extracted from the attribute information of the network resource separately.
  • a string that satisfies the specified string condition is used as a display phrase, and a string that satisfies a specified string condition is extracted from a keyword of a network resource as a display phrase, and the like.
  • the filtering rule can be pre-configured according to the specific application scenario, and used to filter out all the attributes that are useful for the phrase extraction, which are called key attributes. Then, using the key attributes as the corpus, the phrase extraction is performed.
  • network resources are commodities.
  • the user pre-configures the filtering rules and selects key attributes through the filtering rules.
  • the screening rules corresponding to different resource categories are different, and the key attributes selected are also different.
  • the key attributes selected according to the preset filtering rules include but are not limited to those shown in Table 1.
  • the domain dictionary stores the phrase in the domain. Therefore, it can be directly determined whether the phrase included in the domain dictionary is included in the extracted corpus, and if included, the phrase can be directly determined to belong to the display phrase.
  • This method is relatively simple to implement and has high efficiency, and is especially suitable for finding relatively obvious phrases.
  • the attribute value in the extracted attribute information described above will be described in detail as a scheme for displaying a phrase.
  • the attribute information generally includes an attribute name and an attribute value
  • the general implementation structure is an attribute name: an attribute value.
  • attribute values are mostly semantically explicit phrases, so attribute information can be found directly from the extracted corpus, and then the attribute values in the attribute information are extracted as display phrases.
  • the display phrase can be extracted in several ways. It should be noted that the above several methods of extracting display phrases may be used alone or in combination in any combination.
  • the mode phrase extraction process includes: a step of loading a preset mode combination rule and a step of extracting an implicit phrase from the extracted corpus according to the mode combination rule. Based on this, the above implementation manner for extracting corpus to perform a mode phrase extraction process to extract an implicit phrase from the extracted corpus includes:
  • the information fragments that conform to the pattern combination rule are extracted as implicit phrases.
  • the foregoing mode combination rule includes, but is not limited to, at least one of a part of speech combination rule, a regular expression rule, and an attribute expression rule.
  • the above part-of-speech combination rule is used to indicate that a word combination that meets the specified part-of-speech combination condition can be used as an implicit phrase.
  • the above regular expression rule is used to indicate that a word combination that satisfies the specified regular expression can be used as an implicit phrase.
  • the attribute expression rule is used to indicate that the implicit phrase is generated according to the attribute information according to the preset generation rule.
  • the information segment that meets the pattern phrase rule is extracted from the extracted corpus as an implicit phrase, and may specifically include at least one of the following operations:
  • the implicit generation phrase is generated according to the attribute information according to the preset generation rule.
  • part of speech combination patterns are often phrases, for example, adjectives + nouns (" ⁇ JJ ⁇ s+NNS ⁇ 0,1 ⁇ $"), adjectives + adjectives + nouns (" ⁇ JJ ⁇
  • the combination of words consisting of s+JJ ⁇ s+NNS ⁇ 0,1 ⁇ $" is generally a phrase.
  • the part-of-speech combination condition may include: adjective + noun mode, adjective + adjective + noun mode.
  • part-of-speech combination patterns there are other part-of-speech combination patterns.
  • green flowers, natural-color, hooded-collar, etc. are combinations of words belonging to the adjective + noun pattern, belonging to the phrase.
  • a combination of words such as small green flowers belonging to the adjective + adjective + noun pattern also belongs to the phrase.
  • the window length may be set according to the number of words included in the phrase, and the extracted corpus is sequentially sampled according to the set window length, and then it is determined whether the sampled word combination meets the part-of-speech combination condition in the part of speech, and if the judgment result is yes , to determine the word combination as an implicit phrase; if the judgment result is no, discard, and continue to the next Subsampling.
  • two window lengths, 2 and 3, respectively, can be set for sampling word combinations of lengths 2 and 3.
  • ⁇ [a-z]*? ⁇ s+style$" means xxx style, which is a combination of words in the form of the word +style, which may be a phrase that needs to be obtained, for example "sexy style", "bohemia style”;
  • ⁇ [0-9]+% ⁇ s+[a-z]+$ means xx%xxx, which is a combination of words in percent + words, which may be a phrase that needs to be obtained, for example "100%cotton";
  • ⁇ %[0-9]+ ⁇ s+[a-z]+$ means %xx xxx, which is a combination of words in percent + words, which may be a phrase that needs to be obtained, for example "%100cotton”.
  • the search is performed in the extracted corpus, and after the identification part is determined, the words before or after the identification part are determined according to the format of the regular expression. Whether the requirement of the regular expression is met, and if the result of the determination is YES, the word combination formed by the identification part and the word before or after the identification part is acquired as an implicit phrase.
  • the attribute information of the network resource includes an attribute name and an attribute value.
  • the implicit generation phrase may be generated according to the preset information according to the preset generation rule.
  • the generating rule is used to indicate that the attribute name is converted into an display attribute name, and the attribute value and the display attribute name are combined to generate an implicit phrase.
  • generating an implicit phrase according to the attribute information includes:
  • the display attribute name is generated according to the attribute name in the attribute information, and the attribute value and the display attribute name in the attribute information are combined to generate an implicit phrase.
  • a conversion rule between the attribute name and the display attribute name may be preset, and then the conversion rule is generated based on the conversion rule Display the attribute name.
  • the conversion rule can be adaptively set. Taking the clothing category in the e-commerce field as an example, an example of a conversion rule between an attribute name and a display attribute name is as follows:
  • each example consists of three parts, the attribute name, the slash, and the display attribute name.
  • the slash is used to split the attribute name and display attribute name.
  • the left side of the slash is the attribute name, and the right side of the slash is the display attribute name.
  • one way to generate an implicit phrase is: attribute value + display attribute name.
  • the attribute information may be obtained, and the attribute name in the attribute information is converted into the display attribute name according to the above conversion rule, and then the attribute value and the display attribute name are combined according to the above manner to form an implicit phrase.
  • the above display attribute name can be "NULL", that is, when an implicit phrase is generated, the display attribute name is empty, and the attribute name is not used.
  • attribute values are Boolean type, for example, an attribute information is "build-in-bra: yes” (usually in the wedding category goods, used to express whether the wedding dress has a built-in bra), if the attribute value is If "yes” or “y” or the like means "yes”, the attribute value may be omitted directly when an implicit phrase is formed, otherwise it is not omitted.
  • the attribute information "build-in-bra: yes”
  • the implicit phrase formed is "build-in-bra”.
  • the implicit phrase “not-build-in-bra” is formed.
  • the frequency of occurrence of each phrase in the phrase and the implicit phrase can be statistically displayed, that is, the frequency of occurrence of each phrase in the corpus of the data warehouse is used to guide the phrase tag.
  • a method for specifically counting the frequency of occurrence of each phrase is: judging whether each phrase appears in the data warehouse, and if the judgment result is yes, the frequency of the corresponding phrase is increased by 1, otherwise the frequency of the corresponding phrase remains unchanged. This will give you the frequency of occurrence of each phrase.
  • phrase tagging is to mark the combination of words that are most likely to be a phrase. Based on this, if one is identified The higher the frequency of occurrence of the phrase, the higher the probability that the phrase is. Therefore, the phrase to be marked in the extracted corpus can be determined according to the frequency of occurrence of each phrase. For example, the phrase with the highest frequency is used as the phrase to be marked, and then The tagged phrase is marked in the extracted corpus to obtain the phrase tag result.
  • the short-line connected phrase is the phrase to be marked.
  • phrase tagging is huge, and the word component words in English are realized. Through the word component words, a semantically minimally granular text processing unit can be obtained. Next, the feature vocabulary can be extracted based on the result of the phrase tagging. .
  • An implementation method for extracting a feature vocabulary from a phrase tagging result includes:
  • the extracting the candidate vocabulary from the phrase tagging result includes: removing the useless words in the phrase tagging result to obtain the candidate vocabulary, that is, using the remaining words or phrases as the candidate vocabulary.
  • the above-mentioned useless words include, but are not limited to, stop words, quantifiers, direction words, and words that conform to regular expression rules.
  • Stop words are useless words in natural language processing. In English, there is a standard list of stop words, such as standard stop words in English:
  • some processing is performed on the above standard stop words, and some words useful for the phrase, such as with, between, under, and so on, are added, and in addition, some fields are added in combination with the domain of the network resources.
  • New useless words or phrases such as wholesale, retail, shipping, free-shipping, fashion, price, offer, none, quantity, shipment, etc., generate a list of stop words.
  • the words or phrases are used after the data segmentation to determine words that are useless and easily interfered with the characteristics of the network resources.
  • the word or phrase can be found by manual analysis or by automated analysis.
  • each word or phrase in the phrase tag result may be matched in the stop word list, and if the same word or phrase is matched in the stop word list, it is determined that the word or phrase is disabled. The word is then removed from the phrase tag result.
  • Quantifier In English, for example, common quantifiers include one, two, three, width, thin, thick, and so on. Specifically, a quantifier can be formed. In practical application, each word or phrase in the phrase tag result is matched in the above-mentioned quantifier table. If the same word or phrase is matched in the quantifier table, the word or phrase is determined to be a quantifier, and then the Remove from the phrase tag result.
  • Direction word In English, for example, common direction words include front, back, left, right, up, down, and so on. Specifically, a direction vocabulary can be formed. In practical application, each word or phrase in the phrase tag result is matched in the above direction word table, and if the same word or phrase is matched in the direction word table, the word or phrase is determined to be a direction word. It is then removed from the phrase tag result.
  • Words that conform to regular expression rules for example, like " ⁇ [sxlm-]+$",” ⁇ d+%$",” ⁇ % ⁇ d+$", etc.
  • the first one removes certain model words, such as xl, xxxl, etc.
  • the second and third remove individual percentages, and if it is a meaningful percentage, it is marked as a phrase.
  • the candidate words that can be extracted from the phrase tag result are as follows:
  • the candidate words that can be extracted from the phrase tag result are as follows:
  • the candidate words that can be extracted from the phrase tag result are as follows:
  • the weight of each candidate vocabulary can be calculated.
  • the weight of the candidate vocabulary may be represented by a TF-IDF value.
  • the candidate vocabulary of each network resource is formed into a document, which is recorded as D, and the document formed by the candidate vocabulary of the plurality of network resources constitutes a collection of documents;
  • the word frequency of the candidate vocabulary ie, the number of repetitions of the candidate vocabulary among the candidate vocabulary lists of the current network resource, assuming that the frequency of a candidate vocabulary A of the network resource in the document D is TF DA ;
  • N DOC is the total number of all documents
  • N DOCIN is the number of documents in which candidate vocabulary A appears. It is worth noting that the inverse document frequency of the candidate vocabulary is unique
  • a weight that is, a TF-IDF value is calculated.
  • words such as fashion and sexy have very low TF-IDF values, indicating that these words appear very rarely, and the amount of goods covered is extremely low, which is not useful for expressing network resource characteristics; of course, there are also some words.
  • Its TF-IDF value is abnormally high, which indicates that this vocabulary may be wrong and is not useful for expressing network resource characteristics.
  • a weight interval range may be defined in advance, and a candidate vocabulary whose weight is located within a preset weight interval is selected as the feature vocabulary.
  • two variables, TF-IDF HIGH and TF-IDF LOW can be defined to filter the extremely high and very low TF-IDF values, respectively. After filtering, what remains is the candidate vocabulary whose TF-IDF value is within the preset TF-IDF interval, that is, the finally selected feature vocabulary.
  • the present application firstly uses the description information of the network resources as the extracted corpus, and extracts the feature vocabulary from the group words, thereby realizing a method for automatically extracting the feature vocabulary, without manual intervention, and the efficiency is high, further eliminating the need for manual Intervention can process massive amounts of data, which helps to ensure the quantity and quality of feature vocabulary. Further, in the extraction process, various text preprocessing is performed on the description information, and the feature vocabulary whose weight is located within the weight interval is selected from the candidate vocabulary, which is beneficial to further improve the quality of the feature vocabulary.
  • the feature vocabulary obtained above can be applied to the personalized recommendation system to improve recommendation efficiency and accuracy.
  • the feature vocabulary obtained above can also be used for the display page of the network resource, so that the user can understand the network resource more quickly and intuitively.
  • the specific manner of using the feature vocabulary in the personalized recommendation system includes: assigning a corresponding label to the user according to network resources (such as goods) that the user has previously paid attention to (such as browsing, adding a shopping cart, collecting, etc.); when the user browses again
  • network resources such as goods
  • the network resource it can associate with the characteristic vocabulary of the network resource according to the user's label, and recommend the associated network resource to the user.
  • feature vocabulary has certain divergence. Therefore, product recommendation based on feature vocabulary often allows users to discover more novel and desirable products.
  • the selection or ordering of feature vocabulary is often involved. While obtaining the feature vocabulary, the weight of the feature vocabulary can be obtained. The weight of the feature vocabulary can reflect the importance of the feature vocabulary to a certain extent. Therefore, the feature vocabulary can be sorted based on the weight of the feature vocabulary. However, considering the calculation of the weight of the feature vocabulary before calculation, the main consideration is the frequency of the feature vocabulary, and other factors are not considered. Based on this, as shown in FIG. 3, after extracting the feature vocabulary, the method further includes:
  • the above specified factors include but are not limited to the following ones listed in Table 2:
  • the present application can also comprehensively consider various factors, such as the search heat of the vocabulary, the change trend of the search heat, the enthusiasm, the expressive power, etc., and correct the weight of the feature vocabulary so that the corrected weight can be More aspects reflect the importance of feature vocabulary and provide convenience for the application of feature vocabulary.
  • FIG. 4 is a schematic structural diagram of a feature vocabulary extracting apparatus according to another embodiment of the present disclosure. As shown in FIG. 4, the apparatus includes an acquisition module 41, a marking module 42, and an extraction module 43.
  • the obtaining module 41 is configured to obtain description information of the network resource as the extracted corpus.
  • the marking module 42 is configured to perform phrase tagging on the extracted corpus to obtain a phrase tag result.
  • the extracting module 43 is configured to extract, from the phrase tagging result, a feature vocabulary that can reflect a feature of the network resource.
  • the obtaining module 41 is specifically configured to:
  • Text preprocessing is performed on the original description information to obtain standardized description information as an extracted corpus.
  • the original description information of the foregoing network resource includes: a title of the network resource, attribute information, a keyword, At least one of the detailed information and the review information.
  • the obtaining module 41 performs text preprocessing on the original description information to obtain standardized description information as the extracted corpus, and is specifically used for:
  • the original description information is subjected to at least one of a connection symbol reservation process, a case conversion process, a spelling consistency check process, a word segmentation process, a spelling error correction process, and a noun form restoration process.
  • the obtaining module 41 is specifically configured to: when performing the connection symbol reservation processing on the original description information:
  • the obtaining module 41 is specifically configured to: when performing spelling consistency check processing on the original description information:
  • the obtaining module 41 is specifically configured to: when performing noun form restoration processing on the original description information:
  • the morphological restoration of the nouns in the original description information is performed according to at least one of a dictionary and a preset singular and plural transformation rule.
  • marking module 42 is specifically configured to:
  • the tagged phrases are tagged in the extracted corpus to obtain the phrase tag results.
  • extraction module 43 is specifically configured to:
  • the candidate vocabulary whose weight is within the preset weight interval is selected as the feature vocabulary.
  • the extraction module 43 extracts the candidate vocabulary from the phrase tag result, it is specifically used to:
  • the apparatus further includes: a correction module 44 and a sorting module 45.
  • the correction module 44 is configured to correct the weight of the feature vocabulary according to the specified correction factor
  • the sorting module 45 is configured to sort the feature words according to the modified weights.
  • the feature vocabulary extracting apparatus uses the description information of the network resource as the extracted corpus, performs phrase tagging on the extracted corpus, and then extracts a feature vocabulary that can reflect the characteristics of the network resource, and manually guides the network resource provider in the prior art. Compared with the scheme of filling the feature vocabulary, the influence of the subjective factors of the network resource provider is eliminated, and not only a sufficient number of feature vocabularies can be extracted, but also the quality of the feature vocabulary can be guaranteed.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium.
  • the software functional unit described above is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods described in various embodiments of the present application. Part of the steps.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

一种特征词汇提取方法及装置。其中,所述方法包括:获取网络资源的描述信息作为提取语料(101);对提取语料进行词组标记,以获得词组标记结果(102);从词组标记结果中,提取可以反映网络资源特征的特征词汇(103)。采用所述方法,可以保证所提取到的特征词汇的数量,提高特征词汇的质量。

Description

特征词汇提取方法及装置
本申请要求2016年03月17日递交的申请号为201610152669.1、发明名称为“特征词汇提取方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种特征词汇提取方法及装置。
背景技术
随着互联网的高速发展,网上的信息量急剧增加,用户需要从中搜索所需的信息。例如,在电子商务领域,用户需要在成千上万的商品中搜索心仪的商品。
在现有技术中,电子商务网站除了收集商品的结构化信息,例如类目、标题、价格等之外,还需要收集其它能体现商品特点的词汇,例如款式(中袖、无袖、长袖、v领、圆领等)、样式(波点、千鸟格等)、材料(涤纶、丝等)等,丰富商品的基础特征数据。这些体现商品特点的词汇称为特征词汇。
目前,电子商务网站获取商品的特征词汇的常用方法是:在商家发布商品时,引导商家自己填写最能反映商品特点的词汇。由于不是每个商家都愿意且有能力填写每个商品的特征词汇,所以采用这种方法获取商品的特征词汇,存在词组数量不足,词组质量较差等问题。
发明内容
本申请的多个方面提供一种特征词汇提取方法及装置,用以保证特征词汇的数量,提高特征词汇的质量。
本申请的一方面,提供一种特征词汇提取方法,包括:
获取网络资源的描述信息作为提取语料;
对所述提取语料进行词组标记,以获得词组标记结果;
从所述词组标记结果中,提取可以反映所述网络资源特征的特征词汇。
本申请的另一方面,提供一种特征词汇提取装置,包括:
获取模块,用于获取网络资源的描述信息作为提取语料;
标记模块,用于对所述提取语料进行词组标记,以获得词组标记结果;
提取模块,用于从所述词组标记结果中,提取可以反映所述网络资源特征的特征词 汇。
在本申请中,使用网络资源的描述信息作为提取语料,对提取语料进行词组标记,然后从中提取可以反映网络资源特征的特征词汇,与现有技术中引导网络资源提供者手动填写特征词汇的方案相比,消除了网络资源提供者主观因素的影响,不仅可以提取足够数量的特征词汇,而且可以保证特征词汇的质量。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请一实施例提供的特征词汇提取方法的流程示意图;
图2为本申请另一实施例提供的特征词汇提取方法的流程示意图;
图3为本申请又一实施例提供的特征词汇提取方法的流程示意图;
图4为本申请又一实施例提供的特征词汇提取装置的结构示意图;
图5为本申请又一实施例提供的特征词汇提取装置的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
图1为本申请一实施例提供的特征词汇提取方法的流程示意图。如图1所示,该方法包括:
101、获取网络资源的描述信息作为提取语料。
102、对上述提取语料进行词组标记,以获得词组标记结果。
103、从上述词组标记结果中,提取可以反映网络资源特征的特征词汇。
本实施例提供一种特征词汇提取方法,可由特征词汇提取装置来执行,用以从网络资源的描述信息中,提取可以反映网络资源特征的特征词汇,从而保证所提取的特征词 汇的数量和质量。
在详细介绍本实施例的流程之前,首先对本申请中的特征词汇进行解释说明:
网络资源的特征词汇也可以称为网络资源的亮点词汇,是最能描述网络资源的风格、特点等特征的词汇,甚至可以作为网络资源的标签。特征词汇可以表达一个完整的意思,其可以是单词,也可以是词组。其中,词组比单词具有更加有效的表述力,因此词组更加适合成为特征词汇,但不限于此。对用户而言,通过网络资源的特征词汇可以快速的了解到该网络资源的内容以及相关特征。举例说明,假设网络资源为服装,则词组long-sleeve(长袖),描述这件衣服是长袖的,可以作为服装的一个特征词汇。
提取特征词汇的第一步是:准备提取语料。本实施例获取网络资源的描述信息作为提取语料。
网络资源的描述信息主要是一些与网络资源相关的信息,例如可以包括但不限于:网络资源的标题、属性信息、关键词、详情信息以及评论信息中的至少一种。值得说明的是,网络资源的描述信息优选是文本信息,但不限于此。
上述网络资源的属性信息可以是网络资源提供者在发布网络资源时手动填写的,例如包括但不限于:长度、大小、产地、款式、饰品等。
上述网络资源的标题和关键词也是网络资源提供者在发布网络资源时手动填写的。
在电子商务领域,上述网络资源可以是商品或服务。以商品为例,网络资源的标题、属性信息和标题实际上也就是商品的标题、属性信息和关键词。
一种商品的标题的示例如下:
SEXY style girls’black dresses package-hip-dress one shouder’longSLEEVE++Green flowers+v-neck%oneck COCKAIL DR-ESS+wholesale and retail+free shipping 100%cotton
一种商品的属性信息的示例如下:
Length:floor-length|Decoration:beading|Gender:woman|Season:summer|Pattern Type:print|Sleeve Style:off the shoulder|Neckline:o-neck|Style:casual|Place of Origin:Fujian,China Mainland|Number:LC2132-1LC2132-2LC2132-3
一种商品的关键词(逗号分隔)的示例如下:
Blue Dress Party,Fashion Ladies Blue Dress Party,Fashion Ladies Blue Dress Party
值得说明的是,由于是大数据处理,本实施例提取的网络资源的描述信息是非常多的,甚至可以达到亿级别。
在一可选实施方式中,可以从数据仓库中,获取网络资源的原始描述信息,将原始描述信息直接作为提取语料。
在另一可选实施方式中,考虑到网络资源的原始描述信息存在一些不规范和错误的地方,例如使用奇怪的符号连接单词,多个单词写在一起无法区分,单词的拼写错误,同一个单词或者词组在多个地方的写法不一致等,若直接采用原始描述信息,会为后面的处理带来一定困难,降低所提取的特征词汇的质量。基于此,如图2所示,上述步骤101包括:
101a、从数据仓库中,获取网络资源的原始描述信息。
101b、对原始描述信息进行文本预处理,以获得标准化描述信息作为提取语料。
简单来说,就是在获取原始描述信息之后,对原始描述信息进行文本预处理,执行数据清洗,得到清晰可用的数据。图2中其它步骤与图1所示相同,下面一并详细描述。
值得说明的是,上述网络资源的原始描述信息包括:网络资源的标题、属性信息、关键词、详情信息以及评论信息中的至少一种;而网络资源的标准化描述信息包括:经过文本处理后网络资源的标题、属性信息、关键词、详情信息以及评论信息中的至少一种。
进一步,对原始描述信息进行文本预处理包括但不限于:
对原始描述信息进行连接符号预留处理、大小写转换处理、拼写一致性检查处理、单词分割处理、拼写纠错处理以及名词词形还原处理中的至少一种。
连接符号预留处理:
在原始描述信息中,可能存在一些奇怪的连接符号,例如加号“+”,去除这些奇怪的连接符号,原始描述信息就变得比较规范,方便后续处理。但是,有一些特殊的连接符号,可能能够表达特殊或更加丰富的含义。
例如,短线“-”,通过分析发现,这个符号一般是网络资源提供者在填写网络资源的描述信息的过程中主动加上去的,往往连接着两个或多个相关的词,网络资源提供者可能希望将这些词连接在一起,表达一个更加丰富的语义。以o-neck为例,是一个正确的拼写,表达“圆领”的意义,若将短线“-”去掉,则变成oneck,这是一个错误拼写,有可能在后续纠错过程中被纠正为neck,使其丧失原本的意义。
又例如,百分号“%”,在某些情况下可能会用于表示成分含量,例如“100%cotton”中的百分号表示含棉量百分百,所以应该保留。当然,对于不需要保留的情况,可以将其去掉,例如“v-neck%”中的百分号则是不需要保留的,将其删除。
又例如,单引号“’”,在某些情况下可能会表达所属关系,例如girls’中的单引号表示所属关系,需要保留。当然,对于不需要保留的情况,可以将其去掉,例如shoulder’中的单引号就是多余的,将其删除。
基于上述,对于短线、单引号或百分号,可以预先指定需要保留的格式。在对原始描述信息进行连接符号预留处理时,可以判断原始描述信息中是否包含符合指定格式的短线、单引号或百分号,若判断结果为存在,则保留原始描述信息中符合指定格式的短线、单引号或百分号。对于原始描述信息中存在的不符合指定格式的短线、单引号和百分号以及其它连接符号,均删除。
以上述商品的标题为例:
连接符号预留处理之前如下:
SEXY style girls’black dresses package-hip-dress one shouder’longSLEEVE++Green flowers+v-neck%oneck COCKAIL DR-ESS+wholesale and retail+free shipping 100%cotton
连接符号预留处理之后如下:
SEXY style girls’black dresses package-hip-dress one shouder longSLEEVE Green flowers v-neck oneck COCKAIL DR-ESS wholesale and retail free shipping 100%cotton
值得说明的是,在处理百分号时,当发现需要保留时,若在百分号与后面的单词之间没有空格,则可以加一个空格,例如上面的“100%cotton”,变成了“100%cotton”,使得预处理后的信息更加规范。
大小写转换处理:
这里主要是进行大小写之间的统一。根据具体应用需求,可以将大写统一转换为小写,也可以将小写统一转换为大写。
以上述商品的标题为例,在进行连接符号预留处理之后,大小写转换处理之前的标题示例如下:
SEXY style girls’black dresses package-hip-dress one shouder longSLEEVE Green flowers v-neck oneck COCKAIL DR-ESS wholesale and retail free shipping 100%cotton
经过连接符号预留处理和大小写转换处理之后的标题示例如下:
sexy style girls’black dresses package-hip-dress one shouder longsleeve green flowers v-neck oneck cockail dr-ess wholesale-retail free shipping 100%cotton
拼写一致性检查处理:
经过分析发现,同样一个单词,在不同的地方可能会以不同的拼写方式出现。例如,dresses这个单词,就存在着如下不同的拼写方法(不完全统计):dresses、dr-esses、dress-es等。这种拼写的不一致性,也会给后面的分析带来困难,影响所提取到的特征词汇的质量。基于此,这里预先对原始描述信息进行拼写一致性检查处理,将这些拼写不一致的单词转换为一致的拼写方式。
具体的,对于原始描述信息中的每个单词或词组,若该单词或词组在原始描述信息中出现多种拼写方式,统计每种拼写方式在数据仓库中重复出现的次数;根据每种拼写方式在数据仓库中重复出现的次数,从多种拼写方式中,选择重复出现的次数最多且大于预设阈值的拼写方式作为目标拼写方式,将该单词或词组在原始描述信息中出现的其它拼写方式替换为目标拼写方式。
例如,假设dresses这个单词,在原始描述信息中共出现了dresses、dr-esses和dress-es三种拼写方式,经过统计发现dresses在数据仓库中出现的次数最多且大于预设的次数阈值,则可以将dresses作为目标拼写方式,将dr-esses和dress-es这两种拼写方式替换为dresses。
仍以上述商品的标题为例,则上面经过连接符号预留处理和大小写转换处理的标题示例,进一步经过拼写一致性检查处理之后转化为:
sexy style girls’black dresses package-hip-dress one shouder longsleeve green flowers v-neck o-neck cockail dress wholesale-retail free shipping 100%cotton
单词分割处理:
经过分析发现,原始描述信息中经常出现多个单词写在了一起的情况,比如上面标题中的“longsleeve”,还有单词拼写错误的情况,比如上面标题中的“shouder”(应该是shoulder)和“cockail”(应该是cocktail)。这些错误会严重影响后续处理过程,所以需要订正。
针对上述问题,一种具体订正方式是:对原始描述信息进行单词分割处理,即识别出原始描述信息中存在的连写在一起的单词,对所识别出的连写在一起的单词进行分割。
举例说明,单词分割处理的结果示例如下:
longsleevefloorlengthdress->long sleeve floor length dress
dgdhlongsleevekl->dgdh long sleeve kl
swearskirt->swear skirt
在上述示例中,“->”左侧的是连写在一起的单词,“->”右侧的是分割后的结果。 在上述第一个示例中,待处理字符串是由单词组成的,经过单词分割之后,每个单词都分割出来了。在上述第二个示例中,前面有几个干扰字符,后面也有干扰字符,经过单词分割之后,不仅分割出了确定的单词(long sleeve),而且也将干扰字符识别出来了。在上述第三个示例中,通过采用最优的分割策略,使分割出来的单词更加符合上下文的语义。
综上所述,单词分割处理的过程,就是尽可能的排除前后的干扰字符,识别出单词,并结合上下文确定最优分割策略,使得语义上更加顺畅。
拼写纠错处理:
拼写纠错,就是将错误的拼写形式,订正为正确的形式。例如,将sleve修改为sleeve。值得说明的是,这里的拼写纠错是针对任意字符串(token)的,这里的字符串可以是单词,也可以是多个单词,这样,拼写纠错不仅可以纠正拼写错误的单词,还可以纠正由多个单词形成但拼写错误的词组。
举例说明,拼写纠错处理的结果示例如下:
sleve->sleeve
dres->dress
wholesle->wholesale
shouder->shoulder
saikaaadffdsaf->saikaaadffdsaf
sleevc->sleeve
sleever->sleeve
sleeev->sleeve
sleeevt->sleeve
longsleve->longsleeve
在上述示例中,“->”左侧的错误的拼写形式,“->”右侧的是订正后正确的拼写形式。
上面单独对单词分割处理和拼写纠错处理进行了说明。在实际应用中,两者也可以结合使用。由于某些单词写在了一起,但是也有可能写错,比如上面标题示例中的“longsleve”,这种错误的拼写形式,无法直接被单词分割,所以需要首先被订正,例如会被订正为“longsleeve”。在订正之后,就可以对其进行单词分割处理,将单词分割为正确的形式“long sleeve”。这里将单词分割和拼写纠错结合起来使用,可以解决很多只靠单一技术无法解决的问题,提高了数据预处理的效果。
仍以上述商品的标题为例,则上面经过连接符号预留处理、大小写转换处理以及拼写一致性检查处理的标题示例,进一步经过单词分割和拼写纠错之后转化为:
sexy style girls’black dresses package-hip-dress one shoulder long sleeve green flowers v-neck o-neck cocktail dress wholesale-retail free shipping 100%cotton
名词词形还原:
名词词形还原主要是指将原始描述信息中的名词进行词形还原,即名词复数变单数。
值得说明的是,本实施例考虑到动名词或者动词的过去式可能是形容词,有可能表达特定的意思,所以暂不考虑对动词和形容词进行词形还原。
在本实施例中,可以根据词典和预设单复数变换规则中的至少一种,对原始描述信息中的名词进行词形还原。
其中,基于词典的名词词形还原方式,比较暴力,但是比较可靠。具体的做法是:从词典中获取了所有名词及其复数形式,构建名词和其复数形式之间的映射关系,后续基于该映射关系,识别原始描述信息中的名词复数,并将其还原为名词单数。
基于预设单复数变换规则的名词词形还原方式,预先设定名词单复数变换规则,例如名词变成复数形式的方法一般有后面加“s”,末尾字符是“y”的变成“ies”等,后续基于该变换规则,识别原始描述信息中的名词复数,并按照变换规则对识别出的名词复数进行逆向处理,以还原为名词单数。
在实际应用上,可以优先基于词典进行名词词形还原处理,如果基于词典无法还原为名词单数,进一步,基于单复数变换规则进行名词词形还原处理。一般来说,词典的准确率比较高,而规则的覆盖面比较广泛,将两者结合使用,既可以保证名词词形还原的准确率,又可以保证名词复数尽可能被还原为名词单数。当然,也可以使用上述两种名词词形还原方式中的一种。
仍以上述商品的标题为例,则上面经过连接符号预留处理、大小写转换处理、拼写一致性检查处理、单词分割和拼写纠错的标题示例,进一步经过名词词形还原之后转化为:
sexy style girls’black dress package-hip-dress one shoulder long sleeve green flower v-neck o-neck cocktail dress wholesale-retail free shipping 100%cotton
值得说明的是,上面单独对每种文本预处理方式进行了说明。在实际应用中,各种文本预处理方式可以单独使用,也可以相互结合使用。
原始描述信息在经过上述各种文本预处理之后,将变得规范化,为便于区分,将现 在的描述信息称为标准化描述信息。
之后,可以将标准化描述信息作为提取语料,进行特征词汇的提取处理。
具体的,考虑到单个单词往往在意义表达上不够完整,例如“sleeve”没有意义,但是“long sleeve”就是一个特征词汇,即“长袖的”。也就是说,词组成为特征词汇的概率更高,所以为了提取更多高质量的特征词汇,在获得提取语料之后,首先需要将提取语料中的词组标记出来,然后,基于标记结果进行特征词汇的提取。
在一可选实施方式中,上述对提取语料进行词组标记的实施方式包括:
对提取语料分别执行显示词组提取流程和模式词组提取流程,以从提取语料中提取显示词组和隐式词组;
根据显示词组和隐式词组的出现频次,确定提取语料中的待标记词组;
在提取语料中对待标记词组进行标记,以获得词组标记结果。
在本实施例中,为了更加全面的从提取语料中提取词组,从两方面进行词组提取,一方面是对提取语料执行显示词组提取流程,以从提取语料中提取显示词组,另一方面是对提取语料执行模式词组提取流程,以从提取语料中提取隐式词组。
其中,显示词组是指容易发现的词组,隐式词组是指不容易发现的词组。由此可见,本实施例既能提取显示词组,又能提取隐式词组,所以能够更加全面的提取词组。另外,本实施例基于海量描述信息进行词组提取,不依赖于人工,因此可以避免人工带来的错误,保证词组的质量。
值得说明的是,本实施例并不限制提取显示词组的操作与提取隐式词组的操作之间的执行顺序,可以按照任意先后顺序执行,也可以并行执行。
进一步,上述显示词组提取流程包括:加载预设的显示词组规则的步骤和根据显示词组规则从提取语料中提取显示词组的步骤。基于此,上述对提取语料执行显示词组提取流程,以从提取语料中提取显示词组的实施方式包括:
加载预设的显示词组规则;
从提取语料中,提取符合显示词组规则的信息片段作为显示词组。
在一可选实施方式中,上述显示词组规则包括但不限于:指定字符串条件规则、领域词典规则以及属性值规则中的至少一个规则。
上述指定字符串条件规则用于指示符合指定字符串条件的字符串可以作为显示词组。
上述领域词典规则用于指示属于领域词典中的词组可以作为显示词组。根据领域的 不同,领域词典也会有所不同。例如,在服装领域,《英汉纺织大辞典》可以算作一种领域词典。
上述属性值规则用于指示网络资源的属性信息中的属性值可以作为显示词组。
基于上述具体的显示词组规则,从提取语料中,提取符合显示词组规则的信息片段作为显示词组,具体可以包括以下至少一种操作:
从提取语料中,提取满足指定字符串条件的字符串作为显示词组;
从提取语料中,提取属于领域词典中的词组作为所述显示词组;
在提取语料包括网络资源的属性信息时,提取属性信息中的属性值作为显示词组。
下面对提取满足指定字符串条件的字符串作为显示词组的方案进行详细说明。
具体的,考虑到网络资源的提取语料中存在短线“-”连接的字符串,例如,package-hip-dress、v-neck、o-neck、wholesale-retail、one-shoulder、long-sleeve等均属于通过短线“-”连接的字符串。以短线“-”连接的字符串一般是将多个单词连接在一起,能够表达更加丰富的含义,所以短线“-”连接的字符串是词组的概率较大。当然,也有一些以短线“-”连接的字符串由于不具有实际含义,所以不能作为词组,例如a-b,v-neck-half-sleeve-dress等不属于词组。
基于上述,可以设定一些条件,用于限定能够作为词组的以短线“-”连接的字符串,这些限制条件称为字符串条件,具体包括但不限于以下条件:
字符串以短线“-”连接:这个条件用于限定必须是以短线“-”连接的字符串才能成为词组,其中,以短线“-”连接的字符串可以称为token;
字符串的出现次数大于预设次数阈值:这个条件要求字符串的出现次数大于预设次数阈值,例如大于500次;这里字符串的出现次数是指统计出的该字符串在数据仓库中的出现次数;
字符串不是英文单词:这个条件用于排除单词,即单词不是词组;
字符串的最后一个单词不是以s、es、ex、ed、d、ing、ings、ry、ies、ves、y或a结束:这个条件主要用于避免词组中包含名词复数、动词过去式、现在进行时等;
字符串中不含有连词:这个条件主要用于避免词组中含有连词(如and、but、or、for、so、nor等);
字符串中不含有停用词:这个条件主要用于避免词组中出现停用词(如of、a等);
字符串包含指定个数的单词:这个条件的意思是字符串必须包含指定个数的单词才能成为词组,否则不能成为词组;
字符串中不含有数字(百分数除外):这个条件的意思是含有数字的字符串不能成为词组;
字符串中单词长度小于指定长度(例如小于20个字母):这个条件的意思是字符串中单词的长度要小于指定长度才能成为词组,反之不能成为词组;
字符串的长度大于字符串包含的单词的个数:这个条件的意思是指字符串的长度要大于字符串包含的单词的个数才能成为词组,反之不能成为词组;
字符串不满足指定的正则规则:这个条件的意思是不满足指定的正则规则的字符串才能成为词组,反之,满足正则规则的不能成为词组。例如,这里的正则规则包括但不限于:“as-\w+”,表示以“as-”开头的字符串,“so-\w+”,表示以“so-”开头的字符串。
基于上述字符串条件可以确定哪些字符串是显示词组,哪些字符串不是显示词组。举例说明:
不是词组的字符串:
sleeve-less:最后一个单词以s结束;
dress-es:最后一个单词以s或es结束;
sleeve-s:最后一个单词以s结束;
full-sleevevneckdresssexyclubwear:字符串中单词长度超过了指定长度;
a-b:字符串的长度不大于字符串包含的单词的个数;
half-3sleeve:字符串中含有数字3;
v-neck-half-sleeve-dress:字符串包含的单词过多;
fashion-ladies-blue-dress-party:字符串包含的单词过多;
as-picture:字符串满足了指定的正则规则;
so-good:字符串满足了指定的正则规则。
是词组的字符串:
v-neck
deep-v-neck
green-flower
floor-length
100%-silk
考虑到上述提取语料可以包括但不限于:网络资源的标题、属性信息以及关键词等。 在提取满足指定字符串条件的字符串作为显示词组的实施过程中,可以将网络资源的标题、属性信息以及关键词等整合为一个信息集合,然后从该信息集合中提取满足指定字符串条件的字符串作为显示词组。或者,在提取满足指定字符串条件的字符串作为显示词组的实施过程中,可以单独从网络资源的标题中提取满足指定字符串条件的字符串作为显示词组,单独从网络资源的属性信息中提取满足指定字符串条件的字符串作为显示词组,以及单独从网络资源的关键词中提取满足指定字符串条件的字符串作为显示词组,等等。
对于网络资源来说,一般具有多个属性,但不是每个属性对词组提取都有用。基于此,可以根据具体应用场景,预先配置筛选规则,用于从所有属性中筛选出对词组提取有用的属性,称为关键属性。然后,以关键属性为语料,进行词组提取。
以电子商务领域为例,网络资源为商品。用户预先配置筛选规则,通过筛选规则选择关键属性。其中,不同资源类目对应的筛选规则不同,筛选出的关键属性也不相同。假设id为3的类目为Apparel,则按照预设的筛选规则筛选出的关键属性包括但不限于表1所示。
表1
类目名称 类目ID 关键属性的名称
Apparel 3 Length
Apparel 3 Decoration
Apparel 3 Sleeve Style
Apparel 3 Neckline
Apparel 3 Gender
下面对提取属于领域词典中的词组作为显示词组的方案进行详细说明。
具体的,领域词典中存储有本领域的词组,因此,可以直接判断提取语料中是否包括属于领域词典中的词组,若包括,则可以直接确定该词组属于显示词组。这种方式实现相对简单,效率较高,尤其适合发现比较明显的词组。
下面对上述提取属性信息中的属性值作为显示词组的方案进行详细说明。
具体的,属性信息一般包括属性名和属性值,一般实现结构为属性名:属性值。在这种实现结构中,属性值多为语义明确的词组,所以可以直接从提取语料中发现属性信息,然后提取属性信息中的属性值作为显示词组。
在上述实施例或实施方式中,可以采用上述几种方式提取到显示词组。值得说明的是,上述提取显示词组的几种方式可以单独使用,也可以以任意组合结合使用。
进一步,上述模式词组提取流程包括:加载预设的模式组合规则的步骤和根据模式组合规则从提取语料中提取隐式词组的步骤。基于此,上述对提取语料执行模式词组提取流程,以从提取语料中提取隐式词组的实施方式包括:
加载预设的模式组合规则;
从提取语料中,提取符合模式组合规则的信息片段作为隐式词组。
在一可选实施方式中,上述模式组合规则包括但不限于:词性组合规则、正则表达式规则以及属性表达规则中的至少一个规则。
上述词性组合规则用于指示符合指定词性组合条件的单词组合可以作为隐式词组。
上述正则表达式规则用于指示满足指定正则表达式的单词组合可以作为隐式词组。
上述属性表达规则用于指示按照预设生成规则,根据属性信息生成隐式词组。
基于上述具体的模式词组规则,从提取语料中,提取符合模式词组规则的信息片段作为隐式词组,具体可以包括以下至少一种操作:
从提取语料中,提取满足指定词性组合条件的单词组合作为隐式词组;
从提取语料中,提取满足指定正则表达式的单词组合作为隐式词组;
在提取语料包括网络资源的属性信息时,按照预设生成规则,根据属性信息,生成隐式词组。
下面对上述从提取语料中,提取满足指定词性组合条件的单词组合作为隐式词组的方案进行详细说明。
具体的,经过研究分析发现,有些词性组合模式往往是词组,例如,形容词+名词("^JJ\\s+NNS{0,1}$")、形容词+形容词+名词("^JJ\\s+JJ\\s+NNS{0,1}$")等构成的单词组合,一般是词组。基于此,词性组合条件可以包括:形容词+名词模式、形容词+形容词+名词模式。当然,除了这两种词性组合模式之外,还有其它词性组合模式。例如,green flowers,natural-color,hooded-collar等属于形容词+名词模式的单词组合,属于词组。又例如,small green flowers等属于形容词+形容词+名词模式的单词组合,也属于词组。
在具体实现上,可以按照词组包括的单词个数设置窗口长度,按照设定的窗口长度对提取语料依次采样,然后判断采样到的单词组合在词性上是否符合词性组合条件,若判断结果为是,则确定该单词组合为隐式词组;若判断结果为否,则丢弃,并继续下一 次采样。
其中,若设置词组包括2或3个单词,则可以设置两个窗口长度,分别为2和3,用于采样长度为2和3的单词组合。
下面对上述从提取语料中,提取满足指定正则表达式的单词组合作为隐式词组的方案进行详细说明。
具体的,考虑到有些词组,既不是固定搭配形成的词组,也不符合词性组合模式,即无法通过正常的语法手段来得到,但是这些词组符合一定的构词方式,比如都以style结尾,或者有百分数开头等。针对这些词组,预先设定正则表达式,符合预设正则表达式的单词组合,也是词组。
列举几个表示词组的正则表达式:
"^[a-z]*?\\s+style$"表示xxx style,即单词+style形式的单词组合,可能是词组,需要获取,例如"sexy style","bohemia style";
"^[0-9]+%\\s+[a-z]+$"表示xx%xxx,即百分数+单词的单词组合,可能是词组,需要获取,例如"100%cotton";
"^%[0-9]+\\s+[a-z]+$"表示%xx xxx,即百分数+单词的单词组合,可能是词组,需要获取,例如"%100cotton"。
在具体实现上,可以根据正则表达式中的标识部分(例如style、%),在提取语料中进行查找,当确定该标识部分后,按照正则表达式的格式判断该标识部分之前或之后的单词是否符合正则表达式的要求,若判断结果为是,则获取由该标识部分以及该标识部分之前或之后的单词形成的单词组合作为隐式词组。
下面对上述按照预设生成规则,根据属性信息,生成隐式词组的方案进行详细说明。
具体的,网络资源的属性信息包括属性名和属性值。在提取语料包括网络资源的属性信息时,可以按照预设生成规则,根据属性信息,生成隐式词组。
进一步,上述生成规则用于指示将属性名转换为展示属性名,将属性值和展示属性名进行组合,以生成隐式词组。
基于上述,按照预设生成规则,根据属性信息,生成隐式词组包括:
根据属性信息中的属性名生成展示属性名,将属性信息中的属性值和展示属性名进行组合,以生成隐式词组。
其中,可以预先设定属性名到展示属性名之间的转换规则,然后基于该转换规则生 成展示属性名。根据不同应用场景,该转换规则可以适应性设置。以电子商务领域中的服装类目为例,一种属性名到展示属性名之间的转换规则的示例如下所示:
dresses length/dress
sleeve length/sleeve
sleeve style/sleeve
sleeve type/sleeve
sleeve/sleeve
hooded/hooded
material/NULL
neckline/neckline
waistline/waistline
decoration/decoration
style/style
silhouette/silhouette
fabric type/fabric
season/NULL
for season/NULL
for the season/NULL
pattern type/pattern
color/NULL
color style/NULL
technics/technics
item type/NULL
item name/NULL
product category/NULL
outerwear type/outerwear
eyewear type/NULL
scarves type/NULL
clothing length/clothing
collar/collar
closure type/closure
thickness/thickness
back design/back
built-in bra/built-in bra
waistline/waistline
wedding dress fabric/NULL
在上述示例中,每个示例包含三部分,属性名、斜线和展示属性名。斜线用以分割属性名和展示属性名,斜线左侧是属性名,斜线右侧是展示属性名。
基于上述示例,一种生成隐式词组的方式为:属性值+展示属性名。在具体实现上,可以获取属性信息,根据上述转换规则,将属性信息中的属性名转换为展示属性名,再按照上述方式将属性值与展示属性名组合在一起,形成隐式词组。
例如,假设一属性信息为sleeve length:half,其中,属性名是“sleeve length”,属性值是“half”,可以将属性名“sleeve length”转换为展示属性名“sleeve”,将属性值“half”和展示属性名“sleeve”进行组合,生成隐式词组“half-sleeve”。
又例如,假设一属性信息为sleeve style:bat wing,其中,属性名是“sleeve style”,属性值是“bat wing”,可以将属性名“sleeve style”转换为展示属性名“sleeve”,将属性值“bat wing”和展示属性名“sleeve”进行组合,生成隐式词组“bat-wing-sleeve”。
值得说明的是,上述展示属性名可以为“NULL”,即生成隐式词组时,展示属性名为空,不使用属性名。
另外,对于一些属性值是布尔类型的,例如一属性信息为“build-in-bra:yes”(一般在婚纱类目的商品中,用以表达婚纱是否内置了文胸),如果是属性值是“yes”或“y”等表示“是”,则形成隐式词组时可以直接省略属性值,否则不省略。例如,根据属性信息“build-in-bra:yes”,形成的隐式词组为“build-in-bra”。例如,根据属性信息“build-in-bra:not”,形成的隐式词组“not-build-in-bra”。
由此可见,经过上述几种操作可以提取到隐式词组。值得说明的是,上述提取隐式词组的几种方式可以单独使用,也可以以任意组合结合使用。
在从提取语料中提取到显示词组和隐式词组之后,可以统计显示词组和隐式词组中各词组的出现频次,即统计各词组在数据仓库这个语料库中出现的频次,用于指导词组标记。一种具体统计各词组的出现频次的方法是:判断各词组是否出现在数据仓库中,如果判断结果为是,则将相应词组的频次加1,否则将相应词组的频次保持不变。这样可以得到各词组的出现频次。
词组标记的原则是将最可能是词组的单词组合标记出来。基于此,若识别出的某个 词组的出现频次越高,说明是词组的概率就越高,因此,可以根据各词组的出现频次,确定提取语料中的待标记词组,例如,将出现频次最高的词组作为待标记词组,然后在提取语料中对待标记词组进行标记,以获得词组标记结果。
举例说明,假设一待标记词组的句子如下所示:
new arrive mini dress club wear dress deep v neck off shoulder anti winkle dress live leopard print dress free shipping
经过词组提取和频次统计,可以得到以下词组及其频次:
Figure PCTCN2017075831-appb-000001
上述mini-dress和dress-club以及club-wear-dress都可以作为词组,但是明显有冲突,于是根据频次最大原则,得到了如下的词组标记结果:
new arrive mini-dress club-wear-dress deep-v-neck off-shoulder anti-winkle-dress live leopard-print-dress free shipping
在上述标记结果中,短线连接的词组为待标记词组。
另外,以上述商品的标题为例,在经过文本预处理之后,对其进行词组标记可得到如下结果:
sexy-style girls’black-dress package-hip-dress one-shoulder long-sleeve green-flower v-neck o-neck cocktail-dress wholesale-retail free-shipping 100%-cotton
值得说明的是,词组标记的作用是巨大的,实现了英文的词组分词,通过词组分词,可以得到一个个语义上最小粒度的文本处理单元,接下来,就可以基于词组标记结果提取特征词汇了。
一种具体从词组标记结果中,提取特征词汇的实施方式包括:
从词组标记结果中,提取候选词汇;
根据候选词汇的出现频次,计算候选词汇的权重;
根据候选词汇的权重,选择权重位于预设权重区间范围内的候选词汇作为网络资源 的特征词汇。
可选的,上述从词组标记结果中,提取候选词汇具体包括:去除词组标记结果中的无用词,以获得候选词汇,即将剩余的单词或词组作为候选词汇。
上述无用词包括但不限于:停用词、量词、方向词以及符合正则表达式规则的词。
停用词:是自然语言处理中无用的词。在英文中,有标准的停用词列表,例如英文中的标准停用词有:
[u'i',u'me',u'my',u'myself',u'we',u'our',u'ours',u'ourselves',u'you',u'your',u'yours',u'yourself',u'yourselves',u'he',u'him',u'his',u'himself',u'she',u'her',u'hers',u'herself',u'it',u'its',u'itself',u'they',u'them',u'their',u'theirs',u'themselves',u'what',u'which',u'who',u'whom',u'this',u'that',u'these',u'those',u'am',u'is',u'are',u'was',u'were',u'be',u'been',u'being',u'have',u'has',u'had',u'having',u'do',u'does',u'did',u'doing',u'a',u'an',u'the',u'and',u'but',u'if',u'or',u'because',u'as',u'until',u'while',u'of',u'at',u'by',u'for',u'with',u'about',u'against',u'between',u'into',u'through',u'during',u'before',u'after',u'above',u'below',u'to',u'from',u'up',u'down',u'in',u'out',u'on',u'off',u'over',u'under',u'again',u'further',u'then',u'once',u'here',u'there',u'when',u'where',u'why',u'how',u'all',u'any',u'both',u'each',u'few',u'more',u'most',u'other',u'some',u'such',u'no',u'nor',u'not',u'only',u'own',u'same',u'so',u'than',u'too',u'very',u's',u't',u'can',u'will',u'just',u'don',u'should',u'now']
在本实施例中,对上述标准停用词进行了一些处理,去除了其中一些对于词组来说有用的词,例如with,between,under,over等,另外,结合网络资源所属领域,添加了一些新的无用词或者词组,比如wholesale,retail,shipping,free-shipping,fashion,price,offer,none,quantity,shipment等,生成了停用词列表。
值得说明的是,上述在停用词列表中添加新的无用词或者词组时,这些无用词或词组是经过数据分词之后,确定对于描述网络资源的特征无用且容易造成干扰的词。这词或词组可由人工分析发现,也可以是自动化分析发现。
具体的,可以将词组标记结果中的每个单词或词组,在上述停用词列表中进行匹配,若在停用词列表中匹配到相同的单词或词组,则确定该单词或词组为停用词,于是将其从词组标记结果中删除。
量词:以英文为例,常见的量词包括one、two、three、width、thin、thick等。具体的,可以形成量词表。在实际应用时,将词组标记结果中的每个单词或词组,在上述量词表中进行匹配,若在量词表中匹配到相同的单词或词组,则确定该单词或词组为量词,于是将其从词组标记结果中删除。
方向词:以英文为例,常见的方向词包括front、back、left、right、up、down等。具体的,可以形成方向词表。在实际应用时,将词组标记结果中的每个单词或词组,在上述方向词表中进行匹配,若在方向词表中匹配到相同的单词或词组,则确定该单词或词组为方向词,于是将其从词组标记结果中删除。
符合正则表达式规则的词:例如,类似"^[sxlm-]+$","^\\d+%$","^%\\d+$"等。第一个表示去除某些型号词,比如xl,xxxl等,第二个和第三个去除单个的百分数,如果是有意义的百分数,则会被标记为词组。
以上述商品的标题为例,可以从词组标记结果中提取到的候选词汇如下:
sexy-style girls’black-dress package-hip-dress one-shoulder long-sleeve green-flower v-neck o-neck cocktail-dress 100%-cotton
进一步,以商品的属性信息为例,则可以从词组标记结果中提取到的候选词汇如下:
floor-length beading-decoration woman summer print-pattern off-shoulder o-neck casual-style
更进一步,若同时以商品的标题和属性信息为例,则可以从词组标记结果中提取到的候选词汇如下:
sexy-style girls’black-dress package-hip-dress one-shoulder long-sleeve green-flower v-neck o-neck cocktail-dress 100%-cotton floor-length beading-decoration woman summer print-pattern off-shoulder o-neck casual-style
在获得候选词汇之后,可以计算每个候选词汇的权重。在一种实施方式中,可以用TF-IDF值表示候选词汇的权重。
具体的,将每个网络资源的候选词汇形成一个文档,记为D,大量网络资源的候选词汇形成的文档组成了一个文档的集合;
对每个网络资源,计算其候选词汇的词频(即该候选词汇在当前网络资源的候选词汇列表当中的重复次数,假设网络资源的某个候选词汇A在文档D中的频率为TFD-A
计算所有文档中候选词汇的逆文档频率,假设候选词汇A的逆文档频率计算公式为:IDFA=log(NDOC/(NDOCIN+1));
其中,NDOC为所有文档的总数,NDOCIN为出现候选词汇A的文档的个数。值得说明的是,候选词汇的逆文档频率是唯一的;
对每个网络资源,计算其候选词汇的TF-IDF值,例如计算候选词汇A的TF-IDF值的公式为:TF-IDF(A)=TFD-A×IDFA
对每个网络资源的各候选词汇,都计算出了一个权重,即TF-IDF值。通过比较,可以看出类似fashion、sexy等词其TF-IDF值很低,说明这些词汇出现的极少,其覆盖的商品量极低,对于表达网络资源特征也没有用处;当然,也有一些词汇其TF-IDF值异常高,这说明这个词汇可能是错误的,对于表达网络资源特征也是没有用处的。
基于上述分析,可以预先定义一个权重区间范围,选择权重位于预设权重区间范围内的候选词汇作为特征词汇。例如,具体可以定义两个变量,即TF-IDFHIGH,TF-IDFLOW,来分别过滤极高和极低的TF-IDF值。经过过滤之后,剩下的是TF-IDF值位于预设TF-IDF区间范围内的候选词汇,也就是最终选择出的特征词汇。
由上述可见,本申请首先以网络资源的描述信息作为提取语料,通过组词标记,从中提取特征词汇,实现了一种自动提取特征词汇的方法,无需人工干预,效率较高,进一步由于无需人工干预,可以处理海量数据,有利于可以保证特征词汇的数量和质量。进一步,在提取过程中,对描述信息进行各种文本预处理,以及从候选词汇中选择权重位于权重区间范围内的特征词汇,有利于进一步提高特征词汇的质量。
在获得特征词汇之后,可以将上述获得的特征词汇应用于个性化推荐系统,用以提高推荐效率和精度。另外,也可以将上述获得的特征词汇用于网络资源的展示页面,便于用户更加快速、直观的了解网络资源。
例如,在个性化推荐系统中使用特征词汇的具体方式包括:根据用户以往关注(例如浏览、加购物车、收藏等)的网络资源(例如商品),给用户打上相应的标签;当用户再次浏览网络资源时,可以根据用户的标签,与网络资源的特征词汇进行关联,将关联到的网络资源推荐给用户。以电子商务领域为例,与商品类目相比,特征词汇具有一定的发散性,因此基于特征词汇进行商品推荐,往往可以让用户发现更加新奇并中意的商品。
更进一步,在特征词汇应用过程中,往往会涉及特征词汇的选择或排序。在获得特征词汇的同时,可以得到特征词汇的权重,特征词汇的权重在一定程度上可以体现特征词汇的重要性,因此可以基于特征词汇的权重对特征词汇进行排序。但是,考虑到之前计算在计算特征词汇的权重时,主要考虑的是特征词汇的频次,没有考虑其他因素。基于此,如图3所示,在提取到特征词汇之后,所述方法还包括:
104、根据指定的修正因素,对特征词汇的权重进行修正;其中,修正后的权重能够从更多方面反映特征词汇的重要性。
105、基于修正后的权重,对特征词汇进行排序。在实际应用中,可以优选选择排序 靠前的特征词汇。
可选的,上述指定因素包括但不限于以下表2中所列举几种:
表2
Figure PCTCN2017075831-appb-000002
Figure PCTCN2017075831-appb-000003
本申请在获得特征词汇之后,还可以综合考虑各种因素,例如词汇的搜索热度、搜索热度的变化趋势、新兴程度、表达力等,对特征词汇的权重进行修正,使得修正后的权重能够从更多方面反映特征词汇的重要性,为特征词汇的应用提供便利条件。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
图4为本申请又一实施例提供的特征词汇提取装置的结构示意图。如图4所示,该装置包括:获取模块41、标记模块42和提取模块43。
获取模块41,用于获取网络资源的描述信息作为提取语料。
标记模块42,用于对提取语料进行词组标记,以获得词组标记结果。
提取模块43,用于从词组标记结果中,提取可以反映网络资源特征的特征词汇。
在一可选实施方式中,获取模块41具体可用于:
从数据仓库中,获取网络资源的原始描述信息;
对原始描述信息进行文本预处理,以获得标准化描述信息作为提取语料。
可选的,上述网络资源的原始描述信息包括:网络资源的标题、属性信息、关键词、 详情信息以及评论信息中的至少一种。
进一步,获取模块41在对原始描述信息进行文本预处理,以获得标准化描述信息作为提取语料时,具体用于:
对原始描述信息进行连接符号预留处理、大小写转换处理、拼写一致性检查处理、单词分割处理、拼写纠错处理以及名词词形还原处理中的至少一种。
更进一步,获取模块41在对原始描述信息进行连接符号预留处理时,具体用于:
保留原始描述信息中符合指定格式的短线、单引号或百分号,将原始描述信息中存在的不符合指定格式的短线、单引号和百分号以及其它连接符号删除。
更进一步,获取模块41在对原始描述信息进行拼写一致性检查处理时,具体用于:
对于原始描述信息中的每个单词或词组,若单词或词组在原始描述信息中出现多种拼写方式,统计每种拼写方式在数据仓库中重复出现的次数;
从多种拼写方式中,选择重复出现的次数最多且大于预设次数阈值的拼写方式作为目标拼写方式,将单词或词组在原始描述信息中出现的其它拼写方式替换为目标拼写方式。
更进一步,获取模块41在对原始描述信息进行名词词形还原处理时,具体用于:
根据词典和预设单复数变换规则中的至少一种,对原始描述信息中的名词进行词形还原。
进一步,标记模块42具体用于:
对提取语料分别执行显示词组提取流程和模式词组提取流程,以从提取语料中提取显示词组和隐式词组;
根据显示词组和隐式词组的出现频次,确定提取语料中的待标记词组;
在提取语料中对待标记词组进行标记,以获得词组标记结果。
进一步,提取模块43具体用于:
从词组标记结果中,提取候选词汇;
计算候选词汇的权重;
根据候选词汇的权重,选择权重位于预设权重区间范围内的候选词汇作为特征词汇。
更进一步,提取模块43在从词组标记结果中,提取候选词汇时,具体用于:
去除词组标记结果中的无用词,以获得候选词汇。
进一步,如图5所示,该装置还包括:修正模块44和排序模块45。
修正模块44,用于根据指定的修正因素,对特征词汇的权重进行修正;
排序模块45,用于根据修正后的权重,对特征词汇进行排序。
本实施例提供的特征词汇提取装置,使用网络资源的描述信息作为提取语料,对提取语料进行词组标记,然后从中提取可以反映网络资源特征的特征词汇,与现有技术中引导网络资源提供者手动填写特征词汇的方案相比,消除了网络资源提供者主观因素的影响,不仅可以提取足够数量的特征词汇,而且可以保证特征词汇的质量。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的 精神和范围。

Claims (22)

  1. 一种特征词汇提取方法,其特征在于,包括:
    获取网络资源的描述信息作为提取语料;
    对所述提取语料进行词组标记,以获得词组标记结果;
    从所述词组标记结果中,提取可以反映所述网络资源特征的特征词汇。
  2. 根据权利要求1所述的方法,其特征在于,所述获取网络资源的描述信息作为提取语料,包括:
    从数据仓库中,获取所述网络资源的原始描述信息;
    对所述原始描述信息进行文本预处理,以获得标准化描述信息作为所述提取语料。
  3. 根据权利要求2所述的方法,其特征在于,所述网络资源的原始描述信息包括:所述网络资源的标题、属性信息、关键词、详情信息以及评论信息中的至少一种。
  4. 根据权利要求2所述的方法,其特征在于,所述对所述原始描述信息进行文本预处理,以获得标准化描述信息作为所述提取语料,包括:
    对所述原始描述信息进行连接符号预留处理、大小写转换处理、拼写一致性检查处理、单词分割处理、拼写纠错处理以及名词词形还原处理中的至少一种。
  5. 根据权利要求4所述的方法,其特征在于,所述对所述原始描述信息进行连接符号预留处理,包括:
    保留所述原始描述信息中符合指定格式的短线、单引号或百分号,将所述原始描述信息中存在的不符合指定格式的短线、单引号和百分号以及其它连接符号删除。
  6. 根据权利要求4所述的方法,其特征在于,所述对所述原始描述信息进行拼写一致性检查处理,包括:
    对于所述原始描述信息中的每个单词或词组,若所述单词或词组在所述原始描述信息中出现多种拼写方式,统计每种拼写方式在所述数据仓库中重复出现的次数;
    从所述多种拼写方式中,选择重复出现的次数最多且大于预设次数阈值的拼写方式作为目标拼写方式,将所述单词或词组在所述原始描述信息中出现的其它拼写方式替换为所述目标拼写方式。
  7. 根据权利要求4所述的方法,其特征在于,所述对所述原始描述信息进行名词词形还原处理,包括:
    根据词典和预设单复数变换规则中的至少一种,对所述原始描述信息中的名词进行词形还原。
  8. 根据权利要求1所述的方法,其特征在于,所述对所述提取语料进行词组标记,以获得词组标记结果,包括:
    对所述提取语料分别执行显示词组提取流程和模式词组提取流程,以从所述提取语料中提取显示词组和隐式词组;
    根据所述显示词组和隐式词组的出现频次,确定所述提取语料中的待标记词组;
    在所述提取语料中对所述待标记词组进行标记,以获得所述词组标记结果。
  9. 根据权利要求1所述的方法,其特征在于,所述从所述词组标记结果中,提取可以反映所述网络资源特征的特征词汇,包括:
    从所述词组标记结果中,提取候选词汇;
    根据所述候选词汇的出现频次,计算所述候选词汇的权重;
    根据所述候选词汇的权重,选择权重位于预设权重区间范围内的候选词汇作为所述特征词汇。
  10. 根据权利要求9所述的方法,其特征在于,所述从所述词组标记结果中,提取候选词汇,包括:
    去除所述词组标记结果中的无用词,以获得所述候选词汇。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,所述从所述词组标记结果中,提取可以反映所述网络资源特征的特征词汇之后,还包括:
    根据指定的修正因素,对所述特征词汇的权重进行修正;
    根据修正后的权重,对所述特征词汇进行排序。
  12. 一种特征词汇提取装置,其特征在于,包括:
    获取模块,用于获取网络资源的描述信息作为提取语料;
    标记模块,用于对所述提取语料进行词组标记,以获得词组标记结果;
    提取模块,用于从所述词组标记结果中,提取可以反映所述网络资源特征的特征词汇。
  13. 根据权利要求12所述的装置,其特征在于,所述获取模块具体用于:
    从数据仓库中,获取所述网络资源的原始描述信息;
    对所述原始描述信息进行文本预处理,以获得标准化描述信息作为所述提取语料。
  14. 根据权利要求13所述的装置,其特征在于,所述网络资源的原始描述信息包括:所述网络资源的标题、属性信息、关键词、详情信息以及评论信息中的至少一种。
  15. 根据权利要求13所述的装置,其特征在于,所述获取模块具体用于:
    对所述原始描述信息进行连接符号预留处理、大小写转换处理、拼写一致性检查处理、单词分割处理、拼写纠错处理以及名词词形还原处理中的至少一种。
  16. 根据权利要求15所述的装置,其特征在于,所述获取模块具体用于:
    保留所述原始描述信息中符合指定格式的短线、单引号或百分号,将所述原始描述信息中存在的不符合指定格式的短线、单引号和百分号以及其它连接符号删除。
  17. 根据权利要求15所述的装置,其特征在于,所述获取模块具体用于:
    对于所述原始描述信息中的每个单词或词组,若所述单词或词组在所述原始描述信息中出现多种拼写方式,统计每种拼写方式在所述数据仓库中重复出现的次数;
    从所述多种拼写方式中,选择重复出现的次数最多且大于预设次数阈值的拼写方式作为目标拼写方式,将所述单词或词组在所述原始描述信息中出现的其它拼写方式替换为所述目标拼写方式。
  18. 根据权利要求15所述的装置,其特征在于,所述获取模块具体用于:
    根据词典和预设单复数变换规则中的至少一种,对所述原始描述信息中的名词进行词形还原。
  19. 根据权利要求12所述的装置,其特征在于,所述标记模块具体用于:
    对所述提取语料分别执行显示词组提取流程和模式词组提取流程,以从所述提取语料中提取显示词组和隐式词组;
    根据所述显示词组和隐式词组的出现频次,确定所述提取语料中的待标记词组;
    在所述提取语料中对所述待标记词组进行标记,以获得所述词组标记结果。
  20. 根据权利要求12所述的装置,其特征在于,所述提取模块具体用于:
    从所述词组标记结果中,提取候选词汇;
    计算所述候选词汇的权重;
    根据所述候选词汇的权重,选择权重位于预设权重区间范围内的候选词汇作为所述特征词汇。
  21. 根据权利要求20所述的装置,其特征在于,所述提取模块具体用于:
    去除所述词组标记结果中的无用词,以获得所述候选词汇。
  22. 根据权利要求12-21任一项所述的装置,其特征在于,还包括:
    修正模块,用于根据指定的修正因素,对所述特征词汇的权重进行修正;
    排序模块,用于根据修正后的权重,对所述特征词汇进行排序。
PCT/CN2017/075831 2016-03-17 2017-03-07 特征词汇提取方法及装置 WO2017157200A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610152669.1A CN107203507B (zh) 2016-03-17 2016-03-17 特征词汇提取方法及装置
CN201610152669.1 2016-03-17

Publications (1)

Publication Number Publication Date
WO2017157200A1 true WO2017157200A1 (zh) 2017-09-21

Family

ID=59851772

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/075831 WO2017157200A1 (zh) 2016-03-17 2017-03-07 特征词汇提取方法及装置

Country Status (3)

Country Link
CN (1) CN107203507B (zh)
TW (1) TW201734847A (zh)
WO (1) WO2017157200A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635295A (zh) * 2018-06-01 2019-04-16 安徽省泰岳祥升软件有限公司 一种基于语义分析的诗词检索方法及系统
CN110134931A (zh) * 2019-05-14 2019-08-16 北京字节跳动网络技术有限公司 媒介标题生成方法、装置、电子设备及可读介质
CN110442711A (zh) * 2019-07-03 2019-11-12 平安科技(深圳)有限公司 文本智能化清洗方法、装置及计算机可读存储介质
CN110457699A (zh) * 2019-08-06 2019-11-15 腾讯科技(深圳)有限公司 一种停用词挖掘方法、装置、电子设备及存储介质
CN111126060A (zh) * 2019-12-24 2020-05-08 东软集团股份有限公司 一种主题词的提取方法、装置、设备及存储介质
CN115913676A (zh) * 2022-11-04 2023-04-04 上海申石软件有限公司 云原生应用的访问控制方法、装置、电子设备及存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110580332A (zh) * 2018-06-07 2019-12-17 北京京东尚科信息技术有限公司 自动写作产品信息的方法、系统、电子设备及存储介质
CN109951354B (zh) * 2019-03-12 2021-08-10 北京奇虎科技有限公司 一种终端设备识别方法、系统及存储介质
CN109902152B (zh) * 2019-03-21 2021-07-06 北京百度网讯科技有限公司 用于检索信息的方法和装置
CN112417130B (zh) * 2020-11-19 2023-06-16 贝壳技术有限公司 词语筛选方法、装置、计算机可读存储介质及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727487A (zh) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 一种面向网络评论的观点主题识别方法和系统
CN102193936A (zh) * 2010-03-09 2011-09-21 阿里巴巴集团控股有限公司 一种数据分类的方法及装置
CN103870973A (zh) * 2012-12-13 2014-06-18 阿里巴巴集团控股有限公司 基于电子信息的关键词提取的信息推送、搜索方法及装置
US20150310099A1 (en) * 2012-11-06 2015-10-29 Palo Alto Research Center Incorporated System And Method For Generating Labels To Characterize Message Content

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1435776A (zh) * 2002-01-31 2003-08-13 百度在线网络技术(北京)有限公司 一种基于词汇的计算机索引和检索方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727487A (zh) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 一种面向网络评论的观点主题识别方法和系统
CN102193936A (zh) * 2010-03-09 2011-09-21 阿里巴巴集团控股有限公司 一种数据分类的方法及装置
US20150310099A1 (en) * 2012-11-06 2015-10-29 Palo Alto Research Center Incorporated System And Method For Generating Labels To Characterize Message Content
CN103870973A (zh) * 2012-12-13 2014-06-18 阿里巴巴集团控股有限公司 基于电子信息的关键词提取的信息推送、搜索方法及装置

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635295A (zh) * 2018-06-01 2019-04-16 安徽省泰岳祥升软件有限公司 一种基于语义分析的诗词检索方法及系统
CN109635295B (zh) * 2018-06-01 2022-12-16 安徽省泰岳祥升软件有限公司 一种基于语义分析的诗词检索方法及系统
CN110134931A (zh) * 2019-05-14 2019-08-16 北京字节跳动网络技术有限公司 媒介标题生成方法、装置、电子设备及可读介质
CN110134931B (zh) * 2019-05-14 2023-09-22 北京字节跳动网络技术有限公司 媒介标题生成方法、装置、电子设备及可读介质
CN110442711A (zh) * 2019-07-03 2019-11-12 平安科技(深圳)有限公司 文本智能化清洗方法、装置及计算机可读存储介质
CN110442711B (zh) * 2019-07-03 2023-06-30 平安科技(深圳)有限公司 文本智能化清洗方法、装置及计算机可读存储介质
CN110457699A (zh) * 2019-08-06 2019-11-15 腾讯科技(深圳)有限公司 一种停用词挖掘方法、装置、电子设备及存储介质
CN110457699B (zh) * 2019-08-06 2023-07-04 腾讯科技(深圳)有限公司 一种停用词挖掘方法、装置、电子设备及存储介质
CN111126060A (zh) * 2019-12-24 2020-05-08 东软集团股份有限公司 一种主题词的提取方法、装置、设备及存储介质
CN115913676A (zh) * 2022-11-04 2023-04-04 上海申石软件有限公司 云原生应用的访问控制方法、装置、电子设备及存储介质
CN115913676B (zh) * 2022-11-04 2023-06-02 上海申石软件有限公司 云原生应用的访问控制方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
TW201734847A (zh) 2017-10-01
CN107203507B (zh) 2019-08-13
CN107203507A (zh) 2017-09-26

Similar Documents

Publication Publication Date Title
WO2017157200A1 (zh) 特征词汇提取方法及装置
CN103729359B (zh) 一种推荐搜索词的方法及系统
CN107735782B (zh) 图像和文本数据层级分类器
US8725717B2 (en) System and method for identifying topics for short text communications
US8392438B2 (en) Method and apparatus for identifying synonyms and using synonyms to search
US20140172642A1 (en) Analyzing commodity evaluations
CN104268175B (zh) 一种数据搜索的装置及其方法
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
CN104298732B (zh) 一种面向网络用户的个性化文本排序及推荐方法
CN103577405A (zh) 基于兴趣分析的微博博主社区分类方法
TWI645348B (zh) 商品相關網路文章之自動圖文摘要方法及系統
US9256805B2 (en) Method and system of identifying an entity from a digital image of a physical text
US20150347423A1 (en) Methods for completing a user search
TW202329015A (zh) 於電子商務平台用於執行產品匹配之方法及系統
CN111382364A (zh) 处理信息的方法及装置
CN112445862B (zh) 物联网设备数据集构建方法、装置、电子设备和存储介质
WO2017157201A1 (zh) 词组提取方法及装置
CN108427769B (zh) 一种基于社交网络的人物兴趣标签提取方法
CN113434797B (zh) 一种网页信息提取方法及装置
US20180005300A1 (en) Information presentation device, information presentation method, and computer program product
WO2021088589A1 (zh) 一种任务查询方法及装置
De et al. Content Based Apparel Recommendation for E-Commerce Stores
JP7441982B2 (ja) クエリ整形システム、クエリ整形方法、及びプログラム
Tryfou et al. Web image context extraction based on semantic representation of web page visual segments
CN108280156A (zh) 一种在云搜索平台中构建索引和进行搜索的方法和装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17765742

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17765742

Country of ref document: EP

Kind code of ref document: A1