CN111325025B - Shop name mining method and device - Google Patents

Shop name mining method and device Download PDF

Info

Publication number
CN111325025B
CN111325025B CN202010078834.XA CN202010078834A CN111325025B CN 111325025 B CN111325025 B CN 111325025B CN 202010078834 A CN202010078834 A CN 202010078834A CN 111325025 B CN111325025 B CN 111325025B
Authority
CN
China
Prior art keywords
word
core
segmentation
feature
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010078834.XA
Other languages
Chinese (zh)
Other versions
CN111325025A (en
Inventor
李向阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koukouxiangchuan Beijing Network Technology Co ltd
Original Assignee
Koukouxiangchuan Beijing Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koukouxiangchuan Beijing Network Technology Co ltd filed Critical Koukouxiangchuan Beijing Network Technology Co ltd
Priority to CN202010078834.XA priority Critical patent/CN111325025B/en
Publication of CN111325025A publication Critical patent/CN111325025A/en
Application granted granted Critical
Publication of CN111325025B publication Critical patent/CN111325025B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a shop name mining method and device. The method comprises the following steps: collecting shop names, and generating a core phrase set and a feature phrase set according to the collected shop names; performing word segmentation on any phrase in the feature phrase set and the core phrase set, and calculating the probability of any word segmentation result as a core word and the probability of any word segmentation result as a feature word; the method comprises the steps of obtaining a target shop name to be processed, segmenting the target shop name, and determining at least one segmentation point, wherein words before the segmentation point are used as core words, words after the segmentation point are used as feature words, segmentation probabilities corresponding to the segmentation points are calculated according to the core word probabilities and the feature word probabilities, and the target segmentation point of the target shop name is determined according to the segmentation probabilities, so that the accuracy of mining is improved, the mining time is saved, the processing flow is simplified, and the problems that in the prior art, the repeated operation is caused by the fact that the N-gram method is adopted to mine in a one-level and one-level mode, and the time is consumed are solved.

Description

Shop name mining method and device
Technical Field
The invention relates to the technical field of internet, in particular to a shop name mining method and device.
Background
The store name is divided into two parts: one part is used for describing a name (called a core word) which is relatively unique to the shop, and the other part is used for describing the category of the shop or main dishes (called a feature word) and the like. In general, the core word of the store is located at the front part of the store name, and the feature word is located at the rear part of the store name.
The existing method for mining the shop name is to mine the characteristic words, mainly uses an N-gram method and a manual review mode, and firstly uses the N-gram method to perform word segmentation, wherein the value of N is related to the length of the shop name, for example, the length of the shop name is 5, and then the values of N are respectively 1, 2, 3, 4 and 5, specifically, a suffix word is firstly extracted, high frequency is counted, and manual review is performed, and then the processing is performed by performing repeated operation on two suffix words and 3 suffix words, which needs to be repeated for 5 times, so that the efficiency of mining the shop name is low, and the accuracy of mining is low.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are proposed to provide a shop name mining method and apparatus that overcome or at least partially solve the above problems.
According to an aspect of an embodiment of the present invention, there is provided a shop name mining method including:
collecting shop names, and generating a core phrase set and a feature phrase set according to the collected shop names;
performing word segmentation on any phrase in the feature phrase set and the core phrase set, and calculating the probability of any word segmentation result as a core word and the probability of any word segmentation result as a feature word;
the method comprises the steps of obtaining a target shop name to be processed, segmenting the target shop name, and determining at least one segmentation point, wherein words before the segmentation point are used as core words, words after the segmentation point are used as feature words, segmentation probabilities corresponding to the segmentation points are calculated according to the core word probabilities and the feature word probabilities, and the target segmentation point of the target shop name is determined according to the segmentation probabilities.
Optionally, collecting the store name, and generating the core phrase set and the feature phrase set according to the collected store name further comprises:
s1, obtaining a plurality of shop names, and screening the shop names with the shop name length smaller than or equal to a preset word length as core phrases;
s2, matching the store names by using the core phrases, and recording unmatched store name parts serving as feature phrases into a feature phrase set;
s3, matching the store names by using the characteristic phrases, and recording the unmatched store name parts serving as core phrases into a core phrase set; and (5) performing iteration S2-S3 to obtain a feature phrase set and a core phrase set.
Optionally, calculating a segmentation probability corresponding to each segmentation point according to the core word probability and the feature word probability, and determining a target segmentation point of the target store name according to the segmentation probability further includes:
aiming at any segmentation point, inquiring and determining the probability of a core word and the probability of a feature word, and calculating the segmentation probability corresponding to the segmentation point according to the core word probability and the feature word probability;
and determining the segmentation point corresponding to the maximum segmentation probability as a target segmentation point of the target shop name.
Optionally, segmenting any phrase in the feature phrase set and the core phrase set, and calculating a probability that any segmentation result is used as a core word and a probability that any segmentation result is used as a feature word further includes:
performing word segmentation on any phrase in the feature phrase set and the core phrase set, and counting a first word frequency of any word segmentation result in the core phrase set and a second word frequency of any word segmentation result in the feature phrase set;
and calculating the probability of the word segmentation result as a core word and the probability of the word segmentation result as a feature word according to the first word frequency and the second word frequency.
Optionally, after generating the feature phrase set, the method further comprises: and counting the occurrence word frequency of any characteristic phrase in the characteristic phrase set, and taking the characteristic phrase with the occurrence word frequency greater than or equal to the preset word frequency as a category word or a menu word of the shop.
Optionally, the method further comprises: the method comprises the steps of obtaining a shop search word, matching the shop search word with a core word in shop names, and screening and pushing the shop names of which the core word is matched with the shop search word.
According to another aspect of an embodiment of the present invention, there is provided a shop name excavation apparatus including:
the generating module is suitable for collecting shop names and generating a core phrase set and a feature phrase set according to the collected shop names;
the calculation module is suitable for segmenting any phrase in the feature phrase set and the core phrase set, and calculating the probability of any segmentation result as a core word and the probability of any segmentation result as a feature word;
the word segmentation module is suitable for acquiring a target store name to be processed, segmenting the target store name and determining at least one segmentation point, wherein words before the segmentation point are used as core words, and words after the segmentation point are used as feature words;
and the determining module is suitable for calculating the segmentation probability corresponding to each segmentation point according to the core word probability and the characteristic word probability and determining the target segmentation point of the target shop name according to the segmentation probability.
Optionally, the generating module is further adapted to: s1, obtaining a plurality of shop names, and screening the shop names with the shop name length smaller than or equal to a preset word length as core phrases;
s2, matching the store names by using the core phrases, and recording unmatched store name parts serving as feature phrases into a feature phrase set;
s3, matching the store names by using the characteristic phrases, and recording the unmatched store name parts serving as core phrases into a core phrase set; and (5) performing iteration S2-S3 to obtain a feature phrase set and a core phrase set.
Optionally, the determining module is further adapted to: aiming at any segmentation point, inquiring and determining the probability of a core word and the probability of a feature word, and calculating the segmentation probability corresponding to the segmentation point according to the core word probability and the feature word probability;
and determining the segmentation point corresponding to the maximum segmentation probability as a target segmentation point of the target shop name.
Optionally, the calculation module is further adapted to: performing word segmentation on any phrase in the feature phrase set and the core phrase set, and counting a first word frequency of any word segmentation result in the core phrase set and a second word frequency of any word segmentation result in the feature phrase set;
and calculating the probability of the word segmentation result as a core word and the probability of the word segmentation result as a feature word according to the first word frequency and the second word frequency.
Optionally, the apparatus further comprises: and the processing module is suitable for counting the occurrence word frequency of any characteristic phrase in the characteristic phrase set and taking the characteristic phrase of which the occurrence word frequency is greater than or equal to the preset word frequency as the category word or the menu word of the shop.
Optionally, the apparatus further comprises: the pushing module is suitable for obtaining the shop search words, matching the shop search words with the core words in the shop names, and screening the shop names of which the pushed core words are matched with the shop search words.
According to still another aspect of an embodiment of the present invention, there is provided a computing device including: the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the shop name mining method.
According to a further aspect of the embodiments of the present invention, there is provided a computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform an operation corresponding to the store name mining method.
According to the scheme provided by the embodiment of the invention, when the shop name is mined, the generated core phrase set and the feature phrase set are used for mining the shop name, so that the mining accuracy is improved, the core words and the feature words of the shop name are automatically mined, the problems that the repeated operation is caused and the time is consumed due to the fact that the N-gram method is adopted to mine the shop name in a primary-primary mode are solved, the processing flow is simplified, the N-gram primary-primary mining (N may be a plurality of problems) is converted into a segmentation problem to be processed, and the mining time is saved.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a method for mining store names provided by an embodiment of the invention;
FIG. 2 is a flow chart illustrating a method for mining a store name according to another embodiment of the invention;
fig. 3 is a schematic structural diagram of a shop name mining apparatus according to an embodiment of the present invention;
fig. 4 shows a schematic structural diagram of a computing device provided by an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Fig. 1 is a flowchart illustrating a store name mining method according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:
step S101, store names are collected, and a core phrase set and a feature phrase set are generated according to the collected store names.
In order to accurately mine the store names, it is necessary to collect a large number of store names, collect the store names, and then generate a core phrase set and a feature phrase set from the collected store names. The core phrase set collects a plurality of core phrases, and the feature phrase set collects a plurality of feature phrases.
Step S102, performing word segmentation on any phrase in the feature phrase set and the core phrase set, and calculating the probability of any word segmentation result as a core word and the probability of any word segmentation result as a feature word.
After the core phrase set and the feature phrase set are generated according to step S101, a phrase is optionally selected from the core phrase set, and the phrase is segmented, for example, a segmentation method based on character string matching, or a segmentation method based on understanding, or a segmentation method based on statistics, etc. may be adopted to segment the phrase. After word segmentation processing, at least one word segmentation result is obtained, and for any word segmentation result, the probability that the word segmentation result is used as a core word and the probability that the word segmentation result is used as a feature word need to be calculated.
Similarly, a phrase is arbitrarily selected from the feature phrase set, and the phrase is subjected to word segmentation processing, for example, a word segmentation method based on character string matching, or a word segmentation method based on understanding, or a word segmentation method based on statistics, etc. may be adopted to perform word segmentation processing on the phrase, and a person skilled in the art can flexibly select a word segmentation method according to needs, or comprehensively use multiple methods to perform word segmentation processing, and certainly, the method is not limited to the enumerated word segmentation method, and other word segmentation methods may also be used. After word segmentation processing, at least one word segmentation result is obtained, and for any word segmentation result, the probability that the word segmentation result is used as a core word and the probability that the word segmentation result is used as a feature word need to be calculated.
Step S103, obtaining a target shop name to be processed, segmenting the target shop name, and determining at least one segmentation point, wherein the words before the segmentation point are used as core words, the words after the segmentation point are used as feature words, the segmentation probability corresponding to each segmentation point is calculated according to the core word probability and the feature word probability, and the target segmentation point of the target shop name is determined according to the segmentation probability.
When there is a need for store name mining, a target store name to be processed may be acquired, and then word segmentation processing may be performed on the acquired target store name, where the word segmentation processing may be performed by the same method as that used in the word segmentation processing performed in step S102, or by a different word segmentation method. The method comprises the steps of segmenting a target shop name into a word or a word consisting of more than two words as a result of word segmentation, determining at least one segmentation point by segmenting the target shop name, specifically, taking the word before the segmentation point as a core word and the word after the segmentation point as a feature word, then calculating segmentation probabilities corresponding to the segmentation points according to the core word probability and the feature word probability, and determining the target segmentation point of the target shop name according to the segmentation probability, wherein the target segmentation point is the best segmentation point, so that the mining for the shop name is most reasonable.
Taking the obtained target shop name to be processed as 'Kang Liande hot spring house' as an example, after the word segmentation processing is carried out on the target shop name 'Kang Liande hot spring house', the word segmentation results are obtained as 'kang', 'Lian', 'De' and 'hot spring house', then four segmentation points can be determined, one of the segmentation points can be between 'Kanglian', at the moment, the 'kang' is taken as a core word, and the 'Lian De hot spring house' is taken as a feature word; one division point can be between 'Lian De', at the moment, 'Kanglian' is taken as a core word, and 'De hot spring museum' is taken as a characteristic word; one point of segmentation can be between the 'De Wen', and then 'Kang Liande' is taken as a core word, and 'hot spring museum' is taken as a characteristic word; one segmentation point may be after "house", which is now "Kang Liande spa" as the core word, without the signature. And calculating the segmentation probability corresponding to each segmentation point according to the core word probability and the characteristic word probability, and determining the target segmentation point of the target shop name according to the segmentation probability.
According to the method provided by the embodiment of the invention, when the shop name is mined, the generated core phrase set and the feature phrase set are used for mining the shop name, the incidence relation between core words and feature words is considered, the mining accuracy is improved, the core words and the feature words of the shop name are automatically mined, the problems that the repeated operation and the time consumption are caused by the fact that the N-gram method is adopted to mine the core words and the feature words in the first level and the second level in the prior art are solved, the processing flow is simplified, the N-gram first level mining (N can be a plurality of problems) is converted into a segmentation problem to be processed, the mining time and the labor cost are saved, and manual rechecking is not needed.
Fig. 2 is a flowchart illustrating a method for mining a store name according to another embodiment of the present invention. As shown in fig. 2, the method comprises the steps of:
step S201, a plurality of shop names are obtained, and shop names with the shop name length smaller than or equal to a preset word length are screened out to be used as core phrases.
Different shops correspond to different shop names, and the shop names are mainly divided into two parts: a part of the word is a relatively unique name describing the store and is a word describing the store itself, and may be referred to as a core word, and another part of the word is a name describing a category of the store, a home-run dish, or the like, and may be referred to as a feature word.
The store names are analyzed to find that the lengths of some store names are smaller than or equal to the preset word length, the store names are words which are used for describing the stores, therefore, after a plurality of store names are obtained, the obtained store names can be screened, the store names with the store name length smaller than or equal to the preset word length are screened out, and the part of the store names are used as core phrases to be used in the subsequent steps, wherein the preset word length can be 3 words, and the method is only used for illustration and has no limiting effect.
Step S202, the core phrases are used for matching the shop names, and the unmatched shop name parts are recorded into the feature phrase set as feature phrases.
Considering that a common store name is composed of a core word and a feature word, and the core words of some store names are the same, after the core phrases are screened out, the screened out core phrases can be used for matching with other store names, if some words in the store name match with the core phrases, the unmatched store name part can be extracted from the store name, and the unmatched store name part is recorded into the feature phrase set as the feature phrase.
For example, since the core phrase selected in step S201 is "kendyr" and the store name is "kendyr house delivery", matching the core phrase "kendyr" with the store name "kendyr house delivery" matches "the" kendyr house delivery ", and" kendyr "is matched, and the unmatched store name is" house delivery ", the" house delivery "can be recorded as the feature phrase in the feature phrase set.
Step S203, matching store names by using the characteristic phrases, and recording unmatched store name parts as core phrases into a core phrase set; and (3) iteratively executing the step S202 to the step S203 to obtain a feature phrase set and a core phrase set.
In step S202, feature phrases are obtained, and in this step, the obtained feature phrases are used to match with other store names, and if some words in one store name match with the feature phrases, then an unmatched store name part can be extracted from the store name, and the unmatched store name part is recorded as a core phrase in the core phrase set.
For example, since the feature phrase "home quick delivery" is obtained in step S202, and the store name is "mcdonald home quick delivery", the feature phrase "home quick delivery" is matched with the store name "mcdonald home quick delivery", the "home quick delivery" is matched, and the unmatched store name portion is "mcdonald", it is possible to record "mcdonald" as the core phrase in the core phrase set.
In this embodiment, the steps S202 to S203 are iteratively executed, for example, 10 rounds may be executed, and such processing is performed on all the obtained store names, so as to finally obtain the feature phrase set and the core phrase set.
The core phrases selected in step S201 are used to match the store names, so that feature phrases can be obtained, the obtained feature phrases are used to match the store names, so that core phrases are obtained, and the store names can be accurately mined based on the association relationship between the feature phrases and the core phrases.
Step S204, performing word segmentation on any phrase in the feature phrase set and the core phrase set, and counting a first word frequency of any word segmentation result in the core phrase set and a second word frequency of any word segmentation result in the feature phrase set.
In order to accurately mine the store names, after the feature phrase set and the core phrase set are obtained, it is further necessary to perform word segmentation on any phrase in the feature phrase set and the core phrase set, and calculate the probability that any word segmentation result is used as a core word and the probability that any word segmentation result is used as a feature word, specifically, the calculation may be performed by the method in step S204 to step S205:
after the core phrase set and the feature phrase set are obtained, one phrase is arbitrarily selected from the core phrase set, and word segmentation processing is performed on the phrase, for example, a word segmentation method based on character string matching, or a word segmentation method based on understanding, or a word segmentation method based on statistics, etc. may be adopted to perform word segmentation processing on the phrase, and a person skilled in the art may flexibly select a word segmentation method according to needs, or perform word segmentation processing by comprehensively using a plurality of methods, of course, the method is not limited to the enumerated word segmentation methods, and other word segmentation methods may also be used. After word segmentation processing, at least one word segmentation result is obtained, and for any word segmentation result, the times of the word segmentation result appearing in a core phrase set need to be counted and recorded as a first word frequency; and the frequency of the word segmentation result appearing in the feature phrase set is recorded as a second word frequency.
Similarly, a phrase is arbitrarily selected from the feature phrase set, and the phrase is subjected to word segmentation processing, for example, a word segmentation method based on character string matching, or a word segmentation method based on understanding, or a word segmentation method based on statistics, etc. may be adopted to perform word segmentation processing on the phrase, and a person skilled in the art can flexibly select a word segmentation method according to needs, or perform word segmentation processing by using multiple methods in a comprehensive manner, of course, the method is not limited to the listed word segmentation methods, and other word segmentation methods may also be used. After word segmentation processing, at least one word segmentation result is obtained, and for any word segmentation result, the times of the word segmentation result appearing in a core phrase set need to be counted and recorded as a first word frequency; and the frequency of the word segmentation result appearing in the feature phrase set is recorded as a second word frequency.
Step S205, calculating the probability of the word segmentation result as the core word and the probability of the feature word according to the first word frequency and the second word frequency.
For any word segmentation result in step S204, a first word frequency and a second word frequency are determined, and then the probability of the word segmentation result as a core word and the probability of the word segmentation result as a feature word can be calculated according to the first word frequency and the second word frequency, for example, the probability of the word segmentation result as a core word is calculated by using the following formula (1), and the probability of the word segmentation result as a feature word is calculated by using the following formula (2).
Core word probability = (first word frequency + 1)/(first word frequency + second word frequency + 2) formula (1)
Feature word probability = (second word frequency + 1)/(first word frequency + second word frequency + 2) formula (2)
Step S206, obtaining the target shop name to be processed, segmenting words of the target shop name, and determining at least one segmentation point, wherein words before the segmentation point are used as core words, and words after the segmentation point are used as feature words.
When there is a need for store name mining, a target store name to be processed may be acquired, and then word segmentation processing may be performed on the acquired target store name, where a method used in the word segmentation processing may be the same as the method used in the word segmentation processing performed in step S204, or a different word segmentation method may be used. The segmentation result is that the target shop name is segmented into a word or a word consisting of more than two words, the position of the segmentation may be the segmentation point of the core word and the feature word of the target shop name, at least one segmentation point can be determined by segmenting the target shop name, the word before the segmentation point is used as the core word, and the word after the segmentation point is used as the feature word.
Taking the obtained target shop name to be processed as 'Kang Liande hot spring house' as an example, after the word segmentation processing is carried out on the target shop name 'Kang Liande hot spring house', the word segmentation results are obtained as 'kang', 'Lian', 'De' and 'hot spring house', then four segmentation points can be determined, one of the segmentation points can be between 'Kanglian', at the moment, the 'kang' is taken as a core word, and the 'Lian De hot spring house' is taken as a feature word; one division point can be between 'Lian De', at the moment, 'Kanglian' is taken as a core word, and 'De hot spring museum' is taken as a characteristic word; one point of segmentation can be between the 'De Wen', and then 'Kang Liande' is taken as a core word, and 'hot spring museum' is taken as a characteristic word; one point of segmentation may be after "museum", which is now "Kang Liande spa" as the core word, with no feature words.
Step S207, for any segmentation point, querying and determining the probability of the core word and the probability of the feature word, and calculating the segmentation probability corresponding to the segmentation point according to the core word probability and the feature word probability.
After the segmentation point is determined, the segmentation probability corresponding to the segmentation point is calculated for any segmentation point, specifically, in step S205, the probability that each segmentation result is used as a core word and the probability that each segmentation result is used as a feature word are calculated in advance, in step S206, the word before the segmentation point is used as a core word, and the word after the segmentation point is used as a feature word, so that the probability of the core word after segmentation and the probability of the feature word can be determined by querying the probability corresponding to each segmentation result, which is described by way of example in step S206. For other segmentation points, the probability of the core word and the probability of the feature word determined by the segmentation point can be calculated according to the above method, which is not described herein again.
After the probabilities of the core words and the probabilities of the feature words are determined, the segmentation probability corresponding to the segmentation point can be calculated according to the probabilities of the core words and the probabilities of the feature words, for example, the segmentation probability corresponding to the segmentation point = the probability of the core words.
And step S208, determining the segmentation point corresponding to the maximum segmentation probability as the target segmentation point of the target shop name.
After the segmentation probability corresponding to each segmentation point is obtained through calculation, the segmentation probabilities can be sequenced, the maximum segmentation probability is screened out, the segmentation point corresponding to the maximum segmentation probability is determined as the target segmentation point of the target shop name, the target segmentation point is the best segmentation point, the excavation aiming at the shop name is reasonable, and the core words and the feature words of the shop name can be determined based on the target segmentation point.
In an alternative embodiment of the present invention, after generating the feature phrase set, the method further comprises: and counting the occurrence word frequency of any characteristic phrase in the characteristic phrase set, and taking the characteristic phrase of which the occurrence word frequency is greater than or equal to the preset word frequency as a category word or a menu word of the shop.
Specifically, after a feature phrase set is generated, for any feature phrase in the feature phrase set, the number of times that the feature phrase appears in the feature phrase set is counted, the feature phrase is used as an appearance word frequency corresponding to the feature phrase, the appearance word frequency corresponding to the feature phrase is compared with a preset word frequency, if the appearance word frequency corresponding to the feature phrase is greater than or equal to the preset word frequency, the feature phrase is used as a category word or a dish word of a store to be stored, and the stored feature phrase which is used as the category word or the dish word of the store can be used for the classification of the store or can be referred to as the initial name of the store by other people.
In an alternative embodiment of the present invention, core words and feature words may be extracted for store names, and when a user searches a search page for a store, the user generally provides descriptor words of the store, which are referred to as store search words, the method further includes: the method comprises the steps of obtaining a shop search word, matching the shop search word with a core word in shop names, screening out the shop names with the core word matched with the shop search word from a plurality of shop names, and pushing the screened shop names to a client.
The method provided by the embodiment of the invention realizes the automatic excavation of the core words and the characteristic words of the shop names, in addition, the association relation between the core words and the characteristic words in the shop names is fully considered when the shop names are excavated, and the shop names are excavated by utilizing the association relation, so that the excavation accuracy is improved, the problems that the repeated operation is caused and the time is consumed due to the fact that the N-gram method is adopted to excavate in a primary-secondary mode in the prior art are solved, the processing flow is simplified, the N-gram primary-secondary excavation (N can be a plurality of problems) is converted into a segmentation problem to be processed, for example, the problem of selecting an optimal segmentation point is converted, the excavation time is saved, the labor cost is saved, and the manual recheck is not needed.
Fig. 3 is a schematic structural diagram of a shop name mining apparatus according to an embodiment of the present invention. As shown in fig. 3, the apparatus includes: the system comprises a generating module 301, a calculating module 302, a word segmentation module 303 and a determining module 304.
The generation module 301 is adapted to collect store names and generate a core phrase set and a feature phrase set according to the collected store names;
a calculating module 302, adapted to perform word segmentation on any phrase in the feature phrase set and the core phrase set, and calculate a probability that any word segmentation result is used as a core word and a probability that any word segmentation result is used as a feature word;
the word segmentation module 303 is adapted to obtain a target store name to be processed, segment the target store name, and determine at least one segmentation point, where a word before the segmentation point is used as a core word, and a word after the segmentation point is used as a feature word;
the determining module 304 is adapted to calculate a segmentation probability corresponding to each segmentation point according to the core word probability and the feature word probability, and determine a target segmentation point of the target store name according to the segmentation probability.
Optionally, the generating module is further adapted to: s1, obtaining a plurality of shop names, and screening the shop names with the shop name length smaller than or equal to a preset word length as core phrases;
s2, matching the store names by using the core phrases, and recording unmatched store name parts serving as feature phrases into a feature phrase set;
s3, matching the store names by using the characteristic phrases, and recording the unmatched store name parts serving as core phrases into a core phrase set; and (4) iteratively executing S2-S3 to obtain a feature phrase set and a core phrase set.
Optionally, the determining module is further adapted to: aiming at any segmentation point, inquiring and determining the probability of a core word and the probability of a feature word, and calculating the segmentation probability corresponding to the segmentation point according to the core word probability and the feature word probability;
and determining the segmentation point corresponding to the maximum segmentation probability as a target segmentation point of the target shop name.
Optionally, the calculation module is further adapted to: performing word segmentation on any phrase in the feature phrase set and the core phrase set, and counting a first word frequency of any word segmentation result in the core phrase set and a second word frequency of any word segmentation result in the feature phrase set;
and calculating the probability of the word segmentation result as a core word and the probability of the word segmentation result as a feature word according to the first word frequency and the second word frequency.
Optionally, the apparatus further comprises: and the processing module is suitable for counting the occurrence word frequency of any characteristic phrase in the characteristic phrase set and taking the characteristic phrase of which the occurrence word frequency is greater than or equal to the preset word frequency as a category word or a menu word of the shop.
Optionally, the apparatus further comprises: the pushing module is suitable for obtaining the shop search words, matching the shop search words with the core words in the shop names, and screening the shop names of which the pushed core words are matched with the shop search words.
According to the device provided by the embodiment of the invention, when the shop name is mined, the shop name is mined by using the generated core phrase set and the feature phrase set, so that the mining accuracy is improved, the core words and the feature words of the shop name are automatically mined, the problems that the repeated operation is caused and the time is consumed due to the fact that the N-gram method is adopted to mine the core words and the feature words in a primary-to-primary mode in the prior art are solved, the processing flow is simplified, the N-gram primary-to-primary mining (N may be a plurality of problems) is converted into a segmentation problem to be processed, and the mining time is saved.
Embodiments of the present invention provide a non-volatile computer storage medium, where at least one executable instruction is stored in the computer storage medium, and the computer executable instruction may execute the store name mining method in any of the above method embodiments.
Fig. 4 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.
As shown in fig. 4, the computing device may include: a processor (processor) 402, a Communications Interface 404, a memory 406, and a Communications bus 408.
Wherein: the processor 402, communication interface 404, and memory 406 communicate with each other via a communication bus 408. A communication interface 404 for communicating with network elements of other devices, such as clients or other servers. The processor 402, configured to execute the program 410, may specifically perform relevant steps in the embodiment of the store name mining method for a computing device described above.
In particular, program 410 may include program code comprising computer operating instructions.
The processor 402 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement an embodiment of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And a memory 406 for storing a program 410. Memory 406 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 410 may specifically be configured to cause the processor 402 to execute the store name mining method in any of the above-described method embodiments. For specific implementation of each step in the program 410, reference may be made to corresponding steps and corresponding descriptions in units in the store name mining embodiment described above, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best modes of embodiments of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. Embodiments of the invention may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims (12)

1. A store name mining method, comprising:
collecting shop names, and generating a core phrase set and a feature phrase set according to the collected shop names;
performing word segmentation on any phrase in the feature phrase set and the core phrase set, and calculating the probability of any word segmentation result as a core word and the probability of any word segmentation result as a feature word;
the method comprises the steps of obtaining a target shop name to be processed, segmenting the target shop name, and determining at least one segmentation point, wherein words before the segmentation point are used as core words, words after the segmentation point are used as feature words, segmentation probabilities corresponding to the segmentation points are calculated according to the core word probability and the feature word probability, and the target segmentation point of the target shop name is determined according to the segmentation probabilities;
wherein, the collecting the shop names, and generating the core phrase set and the feature phrase set according to the collected shop names further comprises:
s1, obtaining a plurality of shop names, and screening the shop names with the shop name length smaller than or equal to a preset word length as core phrases;
s2, matching the store names by using the core phrases, and recording unmatched store name parts serving as feature phrases into a feature phrase set;
s3, matching the store names by using the characteristic phrases, and recording the unmatched store name parts serving as core phrases into a core phrase set; and (5) performing iteration S2-S3 to obtain a feature phrase set and a core phrase set.
2. The method of claim 1, wherein the calculating a segmentation probability corresponding to each segmentation point according to the core word probability and the feature word probability, and the determining a target segmentation point of a target store name according to the segmentation probability further comprises:
aiming at any segmentation point, inquiring and determining the probability of a core word and the probability of a feature word, and calculating the segmentation probability corresponding to the segmentation point according to the core word probability and the feature word probability;
and determining the segmentation point corresponding to the maximum segmentation probability as a target segmentation point of the target shop name.
3. The method according to claim 1 or 2, wherein the segmenting any phrase in the feature phrase set and the core phrase set, and the calculating the probability of any segmented result as a core word and the probability as a feature word further comprises:
performing word segmentation on any phrase in the feature phrase set and the core phrase set, and counting a first word frequency of any word segmentation result in the core phrase set and a second word frequency of any word segmentation result in the feature phrase set;
and calculating the probability of the word segmentation result as a core word and the probability of the word segmentation result as a feature word according to the first word frequency and the second word frequency.
4. The method of claim 1 or 2, wherein after generating the set of feature phrases, the method further comprises: and counting the occurrence word frequency of any characteristic phrase in the characteristic phrase set, and taking the characteristic phrase with the occurrence word frequency greater than or equal to the preset word frequency as a category word or a menu word of the shop.
5. The method according to claim 1 or 2, wherein the method further comprises: the method comprises the steps of obtaining a shop search word, matching the shop search word with a core word in a shop name, and screening and pushing the shop name of which the core word is matched with the shop search word.
6. A shop name mining apparatus comprising:
the generating module is suitable for collecting shop names and generating a core phrase set and a feature phrase set according to the collected shop names;
the calculation module is suitable for segmenting any phrase in the feature phrase set and the core phrase set, and calculating the probability of any segmentation result as a core word and the probability of any segmentation result as a feature word;
the word segmentation module is suitable for acquiring a target store name to be processed, segmenting the target store name and determining at least one segmentation point, wherein words before the segmentation point are used as core words, and words after the segmentation point are used as feature words;
the determining module is suitable for calculating the segmentation probability corresponding to each segmentation point according to the core word probability and the feature word probability and determining the target segmentation point of the target shop name according to the segmentation probability;
wherein the generation module is further adapted to:
s1, obtaining a plurality of shop names, and screening the shop names with the shop name length smaller than or equal to a preset word length as core phrases;
s2, matching the store names by using the core phrases, and recording unmatched store name parts serving as feature phrases into a feature phrase set;
s3, matching the shop names by using the characteristic phrases, and recording the part of the unmatched shop names serving as core phrases into a core phrase set; and (5) performing iteration S2-S3 to obtain a feature phrase set and a core phrase set.
7. The apparatus of claim 6, wherein the determination module is further adapted to: aiming at any segmentation point, inquiring and determining the probability of a core word and the probability of a feature word, and calculating the segmentation probability corresponding to the segmentation point according to the core word probability and the feature word probability;
and determining the segmentation point corresponding to the maximum segmentation probability as a target segmentation point of the target shop name.
8. The apparatus of claim 6 or 7, wherein the computing module is further adapted to: performing word segmentation on any phrase in the feature phrase set and the core phrase set, and counting a first word frequency of any word segmentation result in the core phrase set and a second word frequency of any word segmentation result in the feature phrase set;
and calculating the probability of the word segmentation result as a core word and the probability of the word segmentation result as a feature word according to the first word frequency and the second word frequency.
9. The apparatus of claim 6 or 7, wherein the apparatus further comprises:
and the processing module is suitable for counting the occurrence word frequency of any characteristic phrase in the characteristic phrase set and taking the characteristic phrase of which the occurrence word frequency is greater than or equal to the preset word frequency as a category word or a menu word of the shop.
10. The apparatus of claim 6 or 7, wherein the apparatus further comprises:
the pushing module is suitable for obtaining the shop search word, matching the shop search word with the core word in the shop name, and screening the shop name of which the pushed core word is matched with the shop search word.
11. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the shop name mining method according to any one of claims 1-5.
12. A computer storage medium having stored therein at least one executable instruction that causes a processor to perform operations corresponding to the store name mining method of any one of claims 1-5.
CN202010078834.XA 2020-02-03 2020-02-03 Shop name mining method and device Active CN111325025B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010078834.XA CN111325025B (en) 2020-02-03 2020-02-03 Shop name mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010078834.XA CN111325025B (en) 2020-02-03 2020-02-03 Shop name mining method and device

Publications (2)

Publication Number Publication Date
CN111325025A CN111325025A (en) 2020-06-23
CN111325025B true CN111325025B (en) 2023-04-07

Family

ID=71173244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010078834.XA Active CN111325025B (en) 2020-02-03 2020-02-03 Shop name mining method and device

Country Status (1)

Country Link
CN (1) CN111325025B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2348401A1 (en) * 2009-12-29 2011-07-27 Microgen Aptitude Limited Transformation of hierarchical data formats using graphical rules
JP2017123095A (en) * 2016-01-08 2017-07-13 京セラドキュメントソリューションズ株式会社 Data processing device, data processing program, and data processing method
CN106951415A (en) * 2017-04-01 2017-07-14 银联智策顾问(上海)有限公司 A kind of name of firm searching method and device
CN109885752A (en) * 2019-01-14 2019-06-14 口碑(上海)信息技术有限公司 Brand word method for digging, device, equipment and readable storage medium storing program for executing
CN110263318A (en) * 2018-04-23 2019-09-20 腾讯科技(深圳)有限公司 Processing method, device, computer-readable medium and the electronic equipment of entity name
CN110597995A (en) * 2019-09-20 2019-12-20 税友软件集团股份有限公司 Commodity name classification method, commodity name classification device, commodity name classification equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2348401A1 (en) * 2009-12-29 2011-07-27 Microgen Aptitude Limited Transformation of hierarchical data formats using graphical rules
JP2017123095A (en) * 2016-01-08 2017-07-13 京セラドキュメントソリューションズ株式会社 Data processing device, data processing program, and data processing method
CN106951415A (en) * 2017-04-01 2017-07-14 银联智策顾问(上海)有限公司 A kind of name of firm searching method and device
CN110263318A (en) * 2018-04-23 2019-09-20 腾讯科技(深圳)有限公司 Processing method, device, computer-readable medium and the electronic equipment of entity name
CN109885752A (en) * 2019-01-14 2019-06-14 口碑(上海)信息技术有限公司 Brand word method for digging, device, equipment and readable storage medium storing program for executing
CN110597995A (en) * 2019-09-20 2019-12-20 税友软件集团股份有限公司 Commodity name classification method, commodity name classification device, commodity name classification equipment and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种面向网店商品搜索的中文词系统设计;王敏;《全国第22届计算机技术与应用(CACIS)学术会议暨全国第3届安全关键技术与应用(SCA)学术会议论文集》;全文 *

Also Published As

Publication number Publication date
CN111325025A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
JP6741110B2 (en) Event discovery method, device, device and program
JP6247292B2 (en) Query expansion
JP5449628B2 (en) Determining category information using multistage
CN108897842A (en) Computer readable storage medium and computer system
CN107832444B (en) Event discovery method and device based on search log
CN107480260B (en) Big data real-time analysis method and device, computing equipment and computer storage medium
JP2018537760A (en) Method and apparatus for account mapping based on address information
CN109582155B (en) Recommendation method and device for inputting association words, storage medium and electronic equipment
CN111242318A (en) Business model training method and device based on heterogeneous feature library
CN110895533B (en) Form mapping method and device, computer equipment and storage medium
CN111507636A (en) Business process running state analysis method and system
CN110555108B (en) Event context generation method, device, equipment and storage medium
CN106202440B (en) Data processing method, device and equipment
CN107688563B (en) Synonym recognition method and recognition device
CN109241360B (en) Matching method and device of combined character strings and electronic equipment
CN107590233B (en) File management method and device
CN111325025B (en) Shop name mining method and device
CN111291649B (en) Image recognition method and device and electronic equipment
CN105653540B (en) Method and device for processing file attribute information
WO2016101737A1 (en) Search query method and apparatus
CN112579713B (en) Address recognition method, address recognition device, computing equipment and computer storage medium
CN110633430B (en) Event discovery method, apparatus, device, and computer-readable storage medium
CN108280198B (en) List generation method and apparatus
CN109460255B (en) Memory address query method and device
CN108304433B (en) Data searching method and equipment, storage medium and server thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant