CN111881674B

CN111881674B - Core commodity word mining method and device, electronic equipment and storage medium

Info

Publication number: CN111881674B
Application number: CN202010601024.8A
Authority: CN
Inventors: 黄志标; 裴一飞
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2023-07-25
Anticipated expiration: 2040-06-28
Also published as: CN111881674A

Abstract

The application discloses a core commodity word mining method, a device, electronic equipment and a storage medium, and relates to the fields of artificial intelligence, electronic commerce, natural language processing and Internet, wherein the method can comprise the following steps: dividing the commodity title of the commodity to be processed; respectively obtaining weights of terms obtained by word segmentation; determining polar core words from each term; determining candidate core commodity words according to the polar core words and the weights; and determining the core commodity words according to the candidate core commodity words and the preset dimension information of the commodity. By applying the scheme, the implementation cost can be reduced, the accuracy of the mined core commodity words can be improved, and the like.

Description

Core commodity word mining method and device, electronic equipment and storage medium

Technical Field

The present application relates to computer application technologies, and in particular, to a method and apparatus for mining core commodity words in the fields of artificial intelligence, electronic commerce, natural language processing, and the internet, an electronic device, and a storage medium.

Background

The core commodity word refers to specific commodity or service sold to buyers by sellers, and has wide application in e-commerce and other scenes, such as similar commodity recommendation based on the core commodity word. For this reason, core commodity words of commodities need to be mined in advance.

Currently, when mining core commodity words, the following manner is generally adopted: inputting the commodity titles into a model obtained by training in advance, and determining core commodity words according to the output of the model. However, the model adopted in the method is generally high in complexity, long in training time, and high in implementation cost because different models need to be trained for different industries when the industry difference is large.

Disclosure of Invention

The application provides a core commodity word mining method, a device, electronic equipment and a storage medium.

A core commodity word mining method comprises the following steps:

dividing the commodity title of the commodity to be processed;

respectively obtaining weights of terms obtained by word segmentation;

determining polar core words from each term;

determining candidate core commodity words according to the polar words and the weights;

and determining the core commodity word according to the candidate core commodity word and the preset dimension information of the commodity.

A core article excavating device comprising: the system comprises a title word segmentation module, a weight acquisition module, a candidate determination module and a commodity word determination module;

the title word segmentation module is used for segmenting the commodity title of the commodity to be processed;

The weight acquisition module is used for respectively acquiring weights of the terms obtained by word segmentation;

the candidate determining module is used for determining an extremely core word from each term, and determining candidate core commodity words according to the extremely core word and the weight;

and the commodity word determining module is used for determining the core commodity word according to the candidate core commodity word and the preset dimension information of the commodity.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as described above.

A computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

One embodiment of the above application has the following advantages or benefits: candidate core commodity words can be determined by word segmentation and the like on commodity titles and can be combined with preset dimension information of commodities, and finally core commodity words are determined.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is a flowchart of an embodiment of a method for mining core article words described in the present application;

FIG. 2 is a schematic diagram of an overall implementation process of the method for mining commodity words described in the present application;

fig. 3 is a schematic structural diagram of an embodiment of a core word mining device 30 according to the present application;

fig. 4 is a block diagram of an electronic device according to a method according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Fig. 1 is a flowchart of an embodiment of a core article word mining method described in the present application. As shown in fig. 1, the following detailed implementation is included.

In 101, a commodity title of a commodity to be processed is segmented.

In 102, weights of terms obtained by word segmentation are obtained respectively.

In 103, an extremely core word is determined from each term.

In 104, candidate core commodity words are determined according to the polar core words and the obtained weights.

In 105, the core commodity word is determined according to the candidate core commodity word and the predetermined dimension information of the commodity.

It can be seen that, in the method of this embodiment, candidate core commodity words can be determined by word segmentation and the like on commodity titles, and core commodity words can be finally determined by combining with preset dimension information of commodities.

The commodity information of different commodities, such as commodity titles, can be stored in the commodity library.

As described in 101, the commodity title of the commodity to be processed may be segmented. The word segmentation method is not limited, and for example, a common barker word segmentation method can be adopted. The word segmentation result, namely the term, obtained by word segmentation may comprise one character or a plurality of characters.

After each term is obtained, the weights of each term may also be obtained separately, as described in 102.

The number DF of the commodity titles containing the term in the commodity title of each commodity in the commodity library can be determined for any term, and the number TF of occurrence of the term in the word segmentation result corresponding to the commodity title can be determined.

After DF and TF are obtained separately, the following calculations may be further performed:

wherein, N represents the number of commodities in the commodity library, and the calculated result can be normalized and then used as the weight of the term.

Repeated terms may be weighted only once. In addition, for terms with larger TF (e.g., greater than a predetermined threshold), filtering may be performed using a pre-constructed vocabulary, where the specific terms included in the vocabulary may be determined according to actual needs, e.g., terms such as "sell" and the like, with larger TF, and in the vocabulary, and filtering may be performed if the subsequent processing is not helpful.

Through the processing, the weight of each term can be accurately and rapidly determined, so that a good foundation is laid for subsequent processing.

In the word segmentation process, the situation that non-Chinese characters such as English words, english model words, english product words and the like are cut may occur, and the situation is particularly common when commodities are used in the fields of electronics, chips and the like.

Therefore, the method and the device have the advantages that after weights of the terms are obtained respectively, adjacent non-Chinese terms in the terms obtained by segmentation can be spliced to obtain new terms, and the weights of the new terms can be obtained.

Specifically, a queue that is empty may be initialized and the terms may be traversed in a front-to-back order, where for any term traversed except for the last term, the following processes may be performed, respectively: determining whether the traversed term is a Chinese term, if yes, splicing the term added into the queue to be used as a new term, and reinitializing the empty queue, if no, adding the traversed term into the queue, and aiming at the last term traversed, carrying out the following processing: and determining whether the traversed term is a Chinese term, if so, splicing the term added into the queue to serve as a new term, and if not, adding the traversed term into the queue, and splicing the term added into the queue to serve as a new term.

In addition, for any new term, the weights of the terms spliced into the new term may be added, and the sum added as the weight of the new term.

For example, after the commodity title is segmented, four terms including a term A, a term B, a term C and a term D are obtained, wherein the term B and the term C are spliced to form a new term B ', and then the sum of weights of the term B and the term C can be used as the weight of the term B ', in addition, through splicing, each term corresponding to the commodity title is changed from the term A, the term B, the term C and the term D to the term A, the term B ' and the term D.

By splicing adjacent non-Chinese terms, the problem of false recognition boundary of core commodity words which possibly appear later can be effectively prevented.

Thereafter, as described in 103, the supercore word may be determined from each term corresponding to the title of the commodity. The polar word refers to a word selected from the word items and meeting the preset requirement, and is the most core word item capable of reflecting commodity characteristics. If the non-Chinese terms are spliced, the term includes the original terms obtained by word segmentation (not spliced with other terms) and the terms obtained by splicing.

Specifically, a mode one may be employed to determine the polar word, including: and performing weight removal on each term, arranging the terms subjected to weight removal in a descending order, and selecting the terms in the front K bits after the arrangement as polar core terms.

Alternatively, the polar words may be determined in a second manner, including: and de-duplication is carried out on each term, the terms of the characters in the pre-constructed vocabulary are determined to be contained in each term after duplication removal and are used as polar core words, the vocabulary consists of M independent characters, and M is a positive integer greater than one.

Alternatively, the polar words may be determined in combination of the first and second modes, including: and de-duplicating each term, arranging each term after de-duplication according to a weight descending order, selecting the term in the front K position after sequencing, determining the term containing the word in the vocabulary in each term after de-duplication, de-duplicating the term in the front K position after sequencing and the term containing the word in the vocabulary, and taking the term as an extremely core word.

In the first mode, the polar core words can be determined according to the commodity title length and the term weight, after the weight of each term is removed, the terms subjected to weight removal are arranged in descending order, and the term in the front K position after the ordering is selected as the polar core word. Where K may be a larger value between the predetermined constant and one-fourth of the number of terms after de-duplication. The specific value of the predetermined constant can be determined according to practical needs, such as 3. It can be seen that the value of K is not fixed, but can be adaptively changed according to the commodity title length, so that the value of K is more flexible and accurate.

In the second mode, a vocabulary can be built in advance, wherein the vocabulary can comprise words commonly appearing in core commodity words such as steel, balls, machines and the like, the words in the vocabulary can be selected through word frequency statistics on a commodity library, and therefore if any term contains the words in the vocabulary, the term can be used as an extremely core word.

Based on the determined polar core word, the candidate core commodity word can be determined more accurately later.

Accordingly, candidate core merchandise words may be determined based on the polar words and the obtained weights, as described in 104. Specifically, any continuous P terms meeting the splicing condition in terms corresponding to the commodity title can be spliced at first, P is a positive integer greater than one, the specific value can be determined according to actual needs, such as 2 or 3, and the like, and the following steps can be included: the obtained vocabulary terms do not contain punctuation marks, the obtained vocabulary terms are not the end of the object verb, the obtained vocabulary terms do not contain pre-constructed black name words, and the obtained vocabulary terms contain polar core words. For any one of the spliced terms, the weights of the terms contained in the spliced terms can be added, and the sum of the weights is used as the weight of the term. And then, the terms obtained by splicing and terms which are not spliced with other terms are arranged in descending order of weight, the terms which have the weight greater than a preset threshold and are in the front Q position after being sequenced are used as candidate core commodity words, and Q is a positive integer greater than one.

For example, assume that each term corresponding to a commodity title includes: the term a, the term b, the term c and the term d take the term b as an example, can be spliced with the term a, can be spliced with the term c and the term d, and the like, but cannot be spliced with the term d only, and each term spliced with the term d must be continuous.

Punctuation marks cannot be contained in the terms obtained by splicing, such as steamed stuffed bun and steamed bread. The resultant vocabulary term cannot be used as a verb or a conjunctive, such as "buy", "sell", etc., that must be followed by a noun or phrase. The terms obtained by splicing cannot contain pre-built black name words, and specifically comprises black name words which can be determined according to actual needs. Meanwhile, the term obtained by splicing must contain the polar core term.

For the terms obtained by splicing and terms which are not spliced with other terms, the terms with weights greater than a preset threshold and ranked in the front Q position can be arranged in descending order, the terms with weights greater than the preset threshold are used as candidate core commodity terms, and the specific value of Q can be determined according to actual needs, such as 5. The threshold value can be obtained by sampling the commodity library, for example, partial commodity titles are sampled, the term of the first 5 of the sorting (sorting according to the descending weight order) is selected after each commodity title is segmented, and the average value of the weights of the selected terms is calculated and used as the threshold value.

After the candidate core merchandise words are obtained, the core merchandise words may be determined based on the candidate core merchandise words and the predetermined dimensional information of the merchandise, as described in 105. The predetermined dimension information may include one or any combination of the following: the attribute value of the commodity, the label of the commodity and the detail of the commodity.

When the commodity is submitted to the electronic commerce platform, the user usually fills in information such as commodity parameters, labels and details, and the core commodity words can be determined by combining the information. Specifically, the following four implementations may be included.

1) Mutual verification of candidate core commodity words and commodity parameters

And determining the core commodity words according to the candidate core commodity words and the attribute values of the commodities.

The commodity parameters have a class of attribute names, such as a product name, an alias, a commodity category and the like, and the attribute values corresponding to the attribute names are often filled in as core words of the commodity.

As a possible implementation manner, each candidate core commodity word arranged in descending order of weight may be traversed, when any candidate core commodity word is traversed, the longest common substring between the candidate core commodity word and the character strings formed by all attribute values of the commodity may be calculated respectively, if the longest common substring can be obtained, the traversing may be stopped, and the obtained longest common substring is used as the core commodity word.

2) Mutual verification of candidate core commodity words and commodity labels

And determining the core commodity words according to the candidate core commodity words and the labels of the commodities.

As a possible implementation manner, the longest public substring between each candidate core commodity word and each segmented label can be calculated respectively, the longest public substring with the largest weight is selected, the editing distance between the selected longest public substring and each candidate core commodity word is calculated respectively, and the candidate core commodity word with the smallest editing distance is used as the core commodity word, wherein the weight of the longest public substring is the sum of the weights of all the terms contained in the longest public substring.

For example, the number of candidate core commodity words is 3, and the number of labels is also 3, so for each candidate core commodity word, the longest common substring between the candidate core commodity word and the label after each word division can be calculated respectively, that is, each candidate core commodity word needs to calculate 3 times of longest common substrings, and accordingly, a plurality of longest common substrings can be obtained, and the longest common substring with the greatest weight can be selected from the longest common substrings. The weight of each longest common substring is the sum of the weights of the terms contained therein, respectively.

For example, the candidate core commodity word is "easy-to-demould resin", the candidate core commodity word is formed by splicing terms obtained by word segmentation, "easy", "demould" and "resin", wherein the weights of the terms "easy", "demould" and "resin" are 0.1222, 0.0144 and 1.1447 respectively, the weight of the candidate core commodity word "easy-to-demould resin" is 0.1222+0.0144+1.1447= 1.2813, the longest public substring calculated between the candidate core commodity word "easy-to-demould resin" and a certain label "resin manufacturer" after word segmentation is "resin", wherein only one term "resin" is contained, the weight of the longest public substring is 1.1447, and if two terms are contained in the longest public substring, the sum of the weights of the two terms can be used as the weight of the longest public substring.

After the longest public sub-string with the largest weight is obtained, editing distances between the longest public sub-string and each candidate core commodity word can be respectively obtained, and then the candidate core commodity word with the smallest editing distance can be used as the core commodity word. If the number of the candidate core commodity words with the minimum editing distance is larger than one, the candidate core commodity word with the maximum weight can be selected as the core commodity word.

3) Mutual verification of candidate core commodity words and commodity details

And determining the core commodity words according to the candidate core commodity words and the detail of the commodity.

As one possible implementation manner, the longest public substring between each candidate core commodity word and the detail of the commodity may be calculated respectively, the longest public substring with the largest weight is selected, the editing distance between the selected longest public substring and each candidate core commodity word is calculated respectively, and the candidate core commodity word with the smallest editing distance is used as the core commodity word, wherein the weight of the longest public substring is the sum of the weights of the terms contained in the longest public substring.

The details of the commodity can be very rich, and in the application, the details are not segmented in a mode of bargain segmentation and the like, but each character (including Chinese characters, non-Chinese characters and the like) in the details is treated as a single substring element.

4) Mutual verification of commodity title and commodity label

The core commodity word can be determined according to the commodity title and the label of the commodity. Generally, if the core commodity word cannot be determined according to the candidate core commodity word and the predetermined dimension information of the commodity, the core commodity word can be determined according to the commodity title of the commodity and the label of the commodity.

As one possible implementation manner, the longest public sub-string between the commodity title of the commodity and the character string formed by all the labels of the commodity can be calculated, and the longest public sub-string with the largest weight is taken as a core commodity word, wherein the weight of the longest public sub-string is the sum of the weights of all the terms contained in the longest public sub-string.

In this application, a dynamic programming algorithm may be used in calculating the longest common substring, as follows:

Taking the example of calculating the longest public substring between the commodity title and the label, dp (i, j) can be defined to represent the length of the longest public substring between the first i terms stra (i) in the commodity title and the first j terms strb (j) in the label, all substrings with the same length are output, and then the state transition equation is as follows:

list one longest public substring

As shown in table one, there are two longest public substrings, namely "fruit tree" and "laxative machine", both of length 1, and if the weight of "laxative machine" is greater than that of "fruit tree", then "laxative machine" can be selected as the core commodity word.

In addition, in the application, the obtained core commodity words can be limited, for example, the length of the core commodity words is limited, and the core commodity words which do not accord with the length limitation can be filtered out. The specific length can be determined according to practical needs, for example, when only Chinese characters are contained, the length can be limited to 2-7 characters, if non-Chinese characters are contained, the length can be limited to 3-18 characters, and the core commodity words containing the non-Chinese characters can be in the form of a U-shaped groove brick machine and the like.

The above modes 1) to 4) are not necessarily all performed, and the specific mode(s) to be performed may depend on the actual need. For example, mode 1) and mode 2) may be performed first, and one core product word with the greatest weight may be selected from the core product words obtained in mode 1) and mode 2) as a final required core product word, mode 3) may be performed if the core product word cannot be obtained in mode 1) and mode 2), mode 4) may be performed if the core product word cannot be obtained in mode 3), and the result may be output to be blank if the core product word cannot be obtained in mode 4). The inability to obtain core merchandise may be caused by a variety of reasons, such as the absence of the longest common substring, or the absence of candidate core merchandise, etc.

Through the processing, the candidate core commodity words determined based on the commodity title can be checked by utilizing the other dimension information of the commodity except the commodity title, so that the problems of cheating the commodity title, mismatching of the commodity title and other dimension information and the like are effectively prevented, and the accuracy of the excavated core commodity words and the like are further improved.

After the core commodity words of the commodities are obtained, the shop-level commodity words of the shops can be determined according to the core commodity words of the commodities corresponding to the same shop, and a sliding window clustering method can be adopted.

Specifically, for any store, the core commodity words of the commodities corresponding to the store may be de-duplicated first, then each core commodity word after de-duplication may be traversed, windows with different preset lengths are used to slide on the core commodity word, the occurrence times of character strings in the windows are counted, and the character string with the occurrence times greater than a preset threshold value is used as the store-level commodity word of the store.

The different predetermined lengths may be 3, 4, 5 character lengths, etc., respectively. The strings in the window indicate the strings that are present in the sliding window. The predetermined threshold may refer to one third of the number of goods corresponding to the store, etc.

When the window slides on the core commodity word, the word segmentation result of the core commodity word can be reserved, for example, the core commodity word is "steamed stuffed bun machine equipment", and the word segmentation is "steamed stuffed bun machine" and "equipment" when the commodity title is segmented, so that meaningless character strings such as "machine equipment" cannot appear in the window.

For example, the core commodity words of the commodity include a fruit tree pesticide sprayer, an orchard pesticide sprayer and the like, and the shop-level commodity word can be a pesticide sprayer and the like.

The number of occurrences of the core commodity word of each commodity corresponding to the store may be counted, and the core commodity word having the number of occurrences greater than a predetermined threshold but not including the character string of the store-level commodity word of the store may be determined, and such core commodity word may be also used as the store-level commodity word of the store. The specific value of the predetermined threshold value can be determined according to actual needs.

Through the processing, the shop-level commodity words can be further obtained on the basis of obtaining the core commodity words of the commodity, so that the content of the mining result is enriched, the mining efficiency is improved, and the like.

Based on the above description, fig. 2 is a schematic diagram of an overall implementation process of the method for mining commodity words described in the present application, and the detailed implementation is referred to the above related description and will not be repeated.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

The foregoing is a description of embodiments of the method, and the following further describes embodiments of the device.

Fig. 3 is a schematic structural diagram of an embodiment of a core word mining device 30 according to the present application. As shown in fig. 3, includes: a title word segmentation module 301, a weight acquisition module 302, a candidate determination module 303, and a commodity word determination module 304.

The title word segmentation module 301 is configured to segment a commodity title of a commodity to be processed.

The weight obtaining module 302 is configured to obtain weights of terms obtained by word segmentation respectively.

And the candidate determining module 303 is configured to determine an excerpt from each term, and determine a candidate core commodity word according to the excerpt and the weight.

The commodity word determining module 304 is configured to determine a core commodity word according to the candidate core commodity word and the predetermined dimension information of the commodity.

The weight obtaining module 302 may determine, for any term, the number DF of the product titles including the term in the product titles of each product in the product library, determine the occurrence number TF of the term in the word segmentation result corresponding to the product title, and further calculateWherein N represents the number of commodities in the commodity library, and the calculated result is normalized and then used as the weight of the term.

The candidate determining module 303 may first splice adjacent non-chinese terms in terms corresponding to the commodity title before determining the polar word from the terms, to obtain a new term, and for any new term, add weights of terms spliced into the new term, and use the sum as the weight of the new term.

Specifically, the candidate determining module 303 may initialize a queue that is empty, and traverse the terms corresponding to the commodity title in the order from front to back, where, for any term that is traversed except for the last term, the following processes are performed respectively: determining whether the traversed term is a Chinese term, if so, splicing the terms added into the queue to serve as a new term, and reinitializing an empty queue, otherwise, adding the traversed term into the queue; for the last term traversed, the following processing is performed: and determining whether the traversed term is a Chinese term, if so, splicing the term added into the queue to serve as a new term, and if not, adding the traversed term into the queue, and splicing the term added into the queue to serve as a new term.

When determining the polar core word, the candidate determining module 303 may perform duplication removal on each term corresponding to the commodity title, arrange the terms after duplication removal in descending order of weight, and select the term in the first K bits after the ordering as the polar core word; alternatively, the candidate determining module 303 may perform deduplication on each term corresponding to the commodity title, and determine that each term after the duplication removal includes a term of a word in a pre-constructed vocabulary, where the vocabulary is composed of M independent words, and M is a positive integer greater than one, as an extremely core word; alternatively, the candidate determining module 303 may perform deduplication on terms corresponding to the commodity title, arrange the terms after deduplication in descending order of weight, select terms in the first K bits after ordering, determine terms including characters in the vocabulary in terms after deduplication, and perform deduplication on terms in the first K bits after ordering and terms including characters in the vocabulary as the polar words.

The candidate determining module 303 may further splice any P consecutive terms that meet a splicing condition from the terms corresponding to the commodity title, where P is a positive integer greater than one, and the matching splicing condition includes: the obtained vocabulary terms do not contain punctuation marks, the obtained vocabulary terms are not the end of the object verb, the obtained vocabulary terms do not contain pre-constructed black name words, and the obtained vocabulary terms contain polar core words; aiming at any spliced term, respectively adding weights of terms contained in the spliced term, and taking the added sum as the weight of the spliced term; and arranging the terms obtained by splicing and terms which are not spliced with other terms in descending order of weight, taking the terms which have weight greater than a preset threshold and are in the front Q position after sequencing as candidate core commodity words, wherein Q is a positive integer greater than one.

In addition, the predetermined dimension information may include one or any combination of the following: the attribute value of the commodity, the label of the commodity and the detail of the commodity.

The commodity word determining module 304 may traverse each candidate core commodity word arranged in descending order of weight, when traversing any candidate core commodity word, calculate the longest common substring between the candidate core commodity word and the character string formed by all attribute values of the commodity, and if the longest common substring can be obtained, stop traversing, and use the obtained longest common substring as the core commodity word.

The commodity word determining module 304 may further calculate the longest common substring between each candidate core commodity word and each segmented label, select the longest common substring with the greatest weight, calculate the edit distance between the selected longest common substring and each candidate core commodity word, and use the candidate core commodity word with the smallest edit distance as the core commodity word, where the weight of the longest common substring is the sum of the weights of the terms contained therein.

The commodity word determining module 304 may further calculate the longest common substring between each candidate core commodity word and the detail of the commodity, select the longest common substring with the largest weight, calculate the edit distance between the selected longest common substring and each candidate core commodity word, and use the candidate core commodity word with the smallest edit distance as the core commodity word, where the weight of the longest common substring is the sum of the weights of the terms included in the longest common substring.

The commodity word determining module 304 may further determine the core commodity word according to the commodity title of the commodity and the label of the commodity when the core commodity word cannot be determined according to the candidate core commodity word and the predetermined dimension information of the commodity.

Specifically, the commodity word determining module 304 may calculate a longest common substring between the commodity title of the commodity and the character strings composed of all the labels of the commodity, and take the longest common substring with the largest weight as the core commodity word, where the weight of the longest common substring is the sum of the weights of the terms contained therein.

The commodity word determining module 304 may also determine a store-level commodity word of the same store according to the core commodity word of each commodity corresponding to the same store.

Specifically, the commodity word determining module 304 may perform deduplication on the core commodity word of each commodity corresponding to the store, traverse each core commodity word after deduplication, slide on the core commodity word by using windows with different predetermined lengths, count the occurrence times of the character strings in the windows, and use the character string with the occurrence times greater than a predetermined threshold as the store-level commodity word of the store.

The commodity word determining module 304 may also count the number of occurrences of the core commodity word of each commodity corresponding to the store, and determine that the core commodity word having the number of occurrences greater than the predetermined threshold value but not including the character string as the store-level commodity word of the store is also used as the store-level commodity word of the store.

The specific workflow of the embodiment of the apparatus shown in fig. 3 is referred to the related description in the foregoing method embodiment, and will not be repeated.

In a word, by adopting the scheme of the embodiment of the application device, candidate core commodity words can be determined by word segmentation and the like on commodity titles, and core commodity words can be finally determined by combining preset dimension information of commodities. In addition, the shop-level commodity words can be further obtained on the basis of obtaining the core commodity words of the commodity, so that the content of the mining result is enriched, the mining efficiency is improved, and the like.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 4, is a block diagram of an electronic device according to a method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 4, the electronic device includes: one or more processors Y01, memory Y02, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of a graphical user interface on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). In fig. 4, a processor Y01 is taken as an example.

The memory Y02 is a non-transitory computer readable storage medium provided in the present application. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein.

The memory Y02 serves as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present application. The processor Y01 executes various functional applications of the server and data processing, i.e., implements the methods in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory Y02.

The memory Y02 may include a memory program area that may store an operating system, at least one application program required for functions, and a memory data area; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory Y02 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory Y02 may optionally include memory located remotely from processor Y01, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, blockchain networks, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device Y03 and an output device Y04. The processor Y01, memory Y02, input device Y03, and output device Y04 may be connected by a bus or otherwise, for example in fig. 4.

The input device Y03 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and like input devices. The output means Y04 may include a display device, an auxiliary lighting means, a tactile feedback means (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific integrated circuitry, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. The terms "machine-readable medium" and "computer-readable medium" as used herein refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable logic devices) for providing machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube or a liquid crystal display monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area networks, wide area networks, blockchain networks, and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A core commodity word mining method comprises the following steps:

dividing the commodity title of the commodity to be processed;

respectively obtaining weights of terms obtained by word segmentation;

determining polar core words from each term;

determining core commodity words according to the candidate core commodity words and the preset dimension information of the commodity;

further comprises: before determining the polar word from each term, splicing adjacent non-Chinese terms in each term corresponding to the commodity title to obtain a new term; for any new term, adding weights of terms spliced into the new term respectively, and taking the sum as the weight of the new term;

The splicing the adjacent non-Chinese terms in the terms corresponding to the commodity title to obtain new terms comprises the following steps: initializing an empty queue; traversing each term corresponding to the commodity title according to the sequence from front to back; for any term except the last term traversed, the following processing is respectively carried out: determining whether the traversed term is a Chinese term, if so, splicing the terms added into the queue to serve as a new term, and reinitializing an empty queue, otherwise, adding the traversed term into the queue; for the last term traversed, the following processing is performed: determining whether the traversed term is a Chinese term, if yes, splicing the term added into the queue to be used as a new term, otherwise, adding the traversed term into the queue, and splicing the term added into the queue to be used as a new term.

2. The method of claim 1, wherein the separately obtaining weights of terms obtained by word segmentation includes:

For any term, determining the number DF of commodity titles containing the term in the commodity titles of all commodities in the commodity library respectively;

determining the occurrence times TF of the term in the word segmentation result corresponding to the commodity title;

calculation ofWherein N represents the number of commodities in the commodity library;

and normalizing the calculation result to be used as the weight of the term.

3. The method of claim 1, wherein said determining the supercore word from the terms comprises:

performing weight removal on each term corresponding to the commodity title, arranging the terms subjected to weight removal in descending order, and selecting the term in the front K position after the arrangement as the polar core term;

or, de-duplication is carried out on each term corresponding to the commodity title, and each term after duplication removal is determined to contain the term of the word in the pre-constructed vocabulary, wherein the vocabulary consists of M independent words, and M is a positive integer greater than one;

or, de-duplicating each term corresponding to the commodity title, arranging each term after de-duplication according to a weight descending order, selecting the term in the front K position after sequencing, determining the term containing the word in the vocabulary in each term after de-duplication, and de-duplicating the term in the front K position after sequencing and the term containing the word in the vocabulary as the polar core word.

4. The method of claim 1, wherein the determining candidate core commodity words from the polar words and the weights comprises:

splicing any continuous P terms meeting splicing conditions in terms corresponding to the commodity titles, wherein P is a positive integer greater than one, and the terms meeting splicing conditions comprise: the obtained vocabulary terms do not contain punctuation marks, the obtained vocabulary terms are not the end of the object verb, the obtained vocabulary terms do not contain pre-constructed black name words, and the obtained vocabulary terms contain the polar core words;

for any one of the spliced word terms, adding weights of the word terms contained in the spliced word terms, and taking the sum as the weight of the spliced word terms;

arranging the terms obtained by splicing and terms which are not spliced with other terms in descending order of weight, taking the terms which have weight greater than a preset threshold and are in the front Q position after sequencing as the candidate core commodity terms, wherein Q is a positive integer greater than one.

5. The method of claim 1, wherein the predetermined dimension information comprises one or any combination of: the attribute value of the commodity, the label of the commodity and the detail of the commodity.

6. The method of claim 5, wherein determining core commodity words from the candidate core commodity words and the attribute values of the commodity comprises:

traversing each candidate core commodity word arranged according to the weight descending order, respectively calculating the longest public substring between the candidate core commodity word and a character string formed by all attribute values of the commodity when traversing any candidate core commodity word, and stopping traversing if the longest public substring can be obtained, and taking the obtained longest public substring as the core commodity word.

7. The method of claim 5, wherein determining the core commodity word from the candidate core commodity word and the label of the commodity comprises:

calculating the longest public substring between each candidate core commodity word and each segmented label, selecting the longest public substring with the largest weight, calculating the editing distance between the selected longest public substring and each candidate core commodity word, and taking the candidate core commodity word with the smallest editing distance as the core commodity word, wherein the weight of the longest public substring is the sum of the weights of all the terms contained in the longest public substring.

8. The method of claim 5, wherein determining the core article comprises:

Calculating the longest public substring between each candidate core commodity word and the detail of the commodity, selecting the longest public substring with the largest weight, calculating the editing distance between the selected longest public substring and each candidate core commodity word, and taking the candidate core commodity word with the smallest editing distance as the core commodity word, wherein the weight of the longest public substring is the sum of the weights of all the terms contained in the longest public substring.

9. The method of claim 5, further comprising:

if the core commodity word cannot be determined according to the candidate core commodity word and the preset dimension information of the commodity, determining the core commodity word according to the commodity title of the commodity and the label of the commodity.

10. The method of claim 9, wherein the determining the core article word according to the article title of the article and the tag of the article comprises:

and calculating the longest public sub-string between the commodity title of the commodity and the character strings formed by all the labels of the commodity, and taking the longest public sub-string with the largest weight as the core commodity word, wherein the weight of the longest public sub-string is the sum of the weights of all the words contained in the longest public sub-string.

11. The method of claim 1, further comprising: and determining the shop-level commodity words of the shops according to the core commodity words of the commodities corresponding to the same shop.

12. The method of claim 11, wherein the determining the store-level commodity word of the store according to the core commodity word of each commodity corresponding to the same store comprises:

performing duplication elimination on core commodity words of all commodities corresponding to the shops;

traversing each core commodity word after duplication removal, utilizing windows with different preset lengths to slide on the core commodity word, counting the occurrence times of character strings in the windows, and taking the character strings with the occurrence times larger than a preset threshold value as store-level commodity words of the stores.

13. The method of claim 12, further comprising:

counting the occurrence times of core commodity words of all commodities corresponding to the shops;

and determining a core commodity word which is larger than a preset threshold value in appearance times and does not contain a character string of the shop-level commodity word of the shop as the shop-level commodity word of the shop.

14. A core article excavating device comprising: the system comprises a title word segmentation module, a weight acquisition module, a candidate determination module and a commodity word determination module;

the commodity word determining module is used for determining core commodity words according to the candidate core commodity words and the preset dimension information of the commodity;

the candidate determining module is further configured to, before determining the polar word from each term, splice adjacent non-chinese terms in each term corresponding to the commodity title to obtain a new term, add weights of terms spliced into the new term for any new term, and use the sum as the weight of the new term; the splicing the adjacent non-Chinese terms in the terms corresponding to the commodity title to obtain new terms comprises the following steps: initializing an empty queue; traversing each term corresponding to the commodity title according to the sequence from front to back; for any term except the last term traversed, the following processing is respectively carried out: determining whether the traversed term is a Chinese term, if so, splicing the terms added into the queue to serve as a new term, and reinitializing an empty queue, otherwise, adding the traversed term into the queue; for the last term traversed, the following processing is performed: determining whether the traversed term is a Chinese term, if yes, splicing the term added into the queue to be used as a new term, otherwise, adding the traversed term into the queue, and splicing the term added into the queue to be used as a new term.

15. The apparatus of claim 14, wherein the weight obtaining module determines, for any term, a number DF of product titles of each product in the product library, which contains the term, respectively, determines a number TF of occurrences of the term in a word segmentation result corresponding to the product title, and calculatesAnd N represents the number of commodities in the commodity library, and the calculated result is normalized and then used as the weight of the term.

16. The apparatus of claim 14, wherein,

the candidate determining module de-weights each term corresponding to the commodity title, arranges the de-weighted terms in a weight descending order, and selects the term in the front K bits after the sorting as the polar core term;

or the candidate determining module performs de-duplication on each term corresponding to the commodity title, and determines that each term after de-duplication contains a term of a word in a pre-constructed vocabulary as the polar core word, wherein the vocabulary consists of M independent words, and M is a positive integer greater than one;

or the candidate determining module performs de-duplication on each term corresponding to the commodity title, arranges each term after de-duplication in a weight descending order, selects the term in the front K position after sorting, determines that each term after de-duplication contains the term of the word in the vocabulary, and uses the term in the front K position after sorting and the term containing the word in the vocabulary as the polar core term after de-duplication.

17. The apparatus of claim 14, wherein the candidate determining module performs stitching on any consecutive P terms that meet stitching conditions among terms corresponding to the commodity title, P being a positive integer greater than one, the stitching conditions comprising: the obtained vocabulary terms do not contain punctuation marks, the obtained vocabulary terms are not the end of the object verb, the obtained vocabulary terms do not contain pre-constructed black name words, and the obtained vocabulary terms contain the polar core words; for any one of the spliced word terms, adding weights of the word terms contained in the spliced word terms, and taking the sum as the weight of the spliced word terms; arranging the terms obtained by splicing and terms which are not spliced with other terms in descending order of weight, taking the terms which have weight greater than a preset threshold and are in the front Q position after sequencing as the candidate core commodity terms, wherein Q is a positive integer greater than one.

18. The apparatus of claim 14, wherein the predetermined dimensional information comprises one or any combination of: the attribute value of the commodity, the label of the commodity and the detail of the commodity.

19. The apparatus of claim 18, wherein the commodity word determining module traverses each candidate core commodity word arranged in descending order of weight, calculates a longest common sub-string between the candidate core commodity word and a character string composed of all attribute values of the commodity when traversing any candidate core commodity word, and stops traversing if the longest common sub-string can be obtained, and uses the obtained longest common sub-string as the core commodity word.

20. The apparatus of claim 18, wherein the commodity word determining module calculates longest common substring between each candidate core commodity word and each segmented tag, selects longest common substring with the greatest weight, calculates editing distance between the selected longest common substring and each candidate core commodity word, and uses the candidate core commodity word with the smallest editing distance as the core commodity word, wherein the weight of the longest common substring is the sum of the weights of the terms contained therein.

21. The apparatus of claim 18, wherein the commodity word determining module calculates longest common substring between each candidate core commodity word and the details of the commodity, selects longest common substring with the greatest weight, calculates editing distance between the selected longest common substring and each candidate core commodity word, and uses the candidate core commodity word with the smallest editing distance as the core commodity word, wherein the weight of the longest common substring is the sum of the weights of the terms contained therein.

22. The apparatus of claim 18, wherein the commodity word determining module is further configured to determine the core commodity word according to a commodity title of the commodity and a label of the commodity if the core commodity word cannot be determined according to the candidate core commodity word and the predetermined dimension information of the commodity.

23. The apparatus of claim 22, wherein the commodity word determining module calculates a longest common substring between a commodity title of the commodity and a character string composed of all tags of the commodity, and takes the longest common substring with the greatest weight as the core commodity word, wherein the weight of the longest common substring is a sum of weights of terms contained therein.

24. The apparatus of claim 14, wherein the commodity word determining module is further configured to determine a store-level commodity word of the store based on core commodity words of respective commodities corresponding to a same store.

25. The apparatus of claim 24, wherein the commodity word determining module performs deduplication on core commodity words of each commodity corresponding to the store, traverses each core commodity word after deduplication, slides over the core commodity words by using windows of different predetermined lengths, counts the occurrence times of character strings in the windows, and uses the character strings with the occurrence times greater than a predetermined threshold as store-level commodity words of the store.

26. The apparatus of claim 25, wherein the commodity word determining module is further configured to count occurrences of core commodity words of respective commodities corresponding to the store, and determine core commodity words that occur more than a predetermined threshold but do not include character strings that are store-level commodity words of the store as store-level commodity words of the store.

27. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

28. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-13.