WO2020001106A1 - Classification model training method and store classification method and device - Google Patents

Classification model training method and store classification method and device Download PDF

Info

Publication number
WO2020001106A1
WO2020001106A1 PCT/CN2019/080022 CN2019080022W WO2020001106A1 WO 2020001106 A1 WO2020001106 A1 WO 2020001106A1 CN 2019080022 W CN2019080022 W CN 2019080022W WO 2020001106 A1 WO2020001106 A1 WO 2020001106A1
Authority
WO
WIPO (PCT)
Prior art keywords
store
feature
information
semantic
review
Prior art date
Application number
PCT/CN2019/080022
Other languages
French (fr)
Chinese (zh)
Inventor
谢仁强
马书超
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020001106A1 publication Critical patent/WO2020001106A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns

Definitions

  • One or more embodiments of the present specification relate to the field of computer technology, and in particular, to a training method of a computer classification model, a method and a device of store classification.
  • One or more embodiments of the present specification describe a method and device that can make full use of Internet data, and by extracting effective training features, train a classification model with higher accuracy, and accurately determine which stores are closed when the store is classified. , Thereby improving the effectiveness of store classification.
  • a training method for a classification model is provided.
  • the classification model is used to determine whether a store is currently a real store, including: selecting a predetermined number of store samples, the store samples corresponding to store information and classification A label, the classification label includes a real store label and a non-real store label, the store information includes review information, and features of the store sample are extracted based on the store information, wherein the features include at least a first feature and The second feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; based on each store sample Training the classification model with the features and the classification labels.
  • selecting a predetermined number of store samples includes: selecting, as a positive sample, stores that have at least one of the following behaviors within a predetermined period: sales vouchers, group purchases, promotions, reservation services, Q & A interactions, advertising, A check-in of the customer at the client is received, wherein the positive sample corresponds to a real store label.
  • selecting a predetermined number of store samples includes: selecting a store that meets the following conditions as a negative sample: it is marked as permanently closed on the electronic map, and the negative sample corresponds to a non-real existence store label.
  • the first feature includes one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.
  • the second feature is extracted by: obtaining the first review information corresponding to a first store sample; and using a pre-trained semantic model to determine each piece of review data in the first review information Respectively corresponding semantic labels, wherein the semantic labels include closed semantics or non-closed semantics; and determine the second feature of the first store sample according to each semantic label.
  • determining the second feature of the first store sample according to each semantic tag includes: determining the first store sample in a case where each semantic tag includes a tag with a closing semantics.
  • the second feature is that it contains the semantics that the store is not a real store.
  • the semantic model includes a supervised model trained on a labeled review dataset.
  • using a pre-trained semantic model to determine the corresponding semantic tag of each piece of review data in the first review information includes: for the first review data in the first review information, through unsupervised A word vector model represents each word in the first review data as a respective word vector; based on the respective word vectors, determining a first review vector corresponding to the first review data; and inputting the first review vector The semantic model to obtain an output result of the semantic model; and adding a semantic label to the first comment data according to the output result.
  • the features further include at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.
  • the store sample further includes a test sample
  • the method further includes: detecting the accuracy of each output result of the classification model for each test sample, to obtain according to the accuracy of each output result A detection result of the classification model; and adjusting the classification model according to the detection result until the detection result meets a preset condition.
  • a method for classifying a store using the classification model trained in any of the methods of the first aspect to determine whether a store is currently a real store, the method includes: obtaining store information of a store to be classified, wherein, The store information includes review information; features of the store to be classified are extracted based on the store information, wherein the features include at least a first feature and a second feature, and the first feature is based at least on a time of the review information Related attributes are obtained, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; the feature of the store to be classified is input into the classification model to obtain the classification An output result of the model; determining whether the store to be classified is a real store currently according to the output result.
  • a training device for a classification model is provided.
  • the classification model is used to determine whether a store is currently a real store.
  • the device includes a selection unit configured to select a predetermined number of store samples.
  • the samples correspond to store information and classification labels, the classification labels including real store labels and non-real store labels, the store information including review information, and an extraction unit configured to extract features of the store sample based on the store information ,
  • the feature includes at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is based on the authenticity with the store contained in the review information It is determined based on the sex-related semantic description; a training unit configured to train the classification model based on the characteristics and the classification tags of each store sample.
  • a device for classifying a store is provided.
  • the classification model trained by the training device of the third aspect is used to determine whether a store is currently a real store.
  • the device includes: an obtaining unit configured to obtain the information of a store to be classified.
  • the store information includes review information
  • an extraction unit configured to extract features of the store to be classified based on the store information, wherein the features include at least a first feature and a second feature, so The first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information
  • the classification unit is configured to set the waiting information
  • the feature of the classified store is input to the classification model to obtain an output result of the classification model
  • a determining unit is configured to determine whether the to-be-categorized store is currently a true store according to the output result.
  • a computer-readable storage medium having stored thereon a computer program, which when executed in a computer, causes the computer to execute the method of the first aspect or the second aspect.
  • a computing device including a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the first aspect or the first aspect is implemented. Two ways.
  • the store information corresponding to the selected store sample includes review information
  • the features of the store sample extracted from the store information include information obtained based on at least time-related attributes of the review information.
  • the first feature and the second feature determined based on the semantic description related to the authenticity of the store included in the review information.
  • the Internet data can be fully utilized to extract effective training features and train a classification model with higher accuracy.
  • the extracted features of the stores to be classified also include the above-mentioned first and second features. In this way, the Internet data can be fully utilized to improve the accuracy of the store classification, and thereby improve the store classification. Effectiveness.
  • FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification.
  • FIG. 2 shows a flowchart of a training method of a classification model according to an embodiment
  • FIG. 3 shows a specific example of the second feature extraction
  • FIG. 4 shows a specific example of the model training process
  • FIG. 5 shows a flowchart of a store classification method according to an embodiment
  • FIG. 6 shows a schematic block diagram of a training device for a classification model according to an embodiment
  • FIG. 7 shows a schematic block diagram of a store classification device according to an embodiment.
  • FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification.
  • users can view store information through client applications, such as map applications, shopping applications, ordering applications, and so on.
  • the client application here can run on various terminal devices with data processing capabilities, such as smart phones, tablet computers, desktop computers, smart watches, and so on.
  • the store information displayed on the client application is provided through the server.
  • the server may be a processing device with a certain data processing capability, or a processing device cluster.
  • the computing platform trains a classification model, and the server uses the classification model to classify the store, determine whether the store is a real store, and display it to the user through a client application.
  • the real existence refers to the fact that the store is a real store, and there is no permanent closure or bankruptcy. It does not include the case of a short (such as two days) suspension of business.
  • the computing platform may be set in a server or a processing device independent of the server, which is not limited in this application.
  • the classification model trained by the computing platform can be reused by the server.
  • the results of the server's classification of the store through the classification model can also be reused.
  • the computing platform may first select a predetermined number of store samples, perform feature extraction on the store samples, and then train a classification model based on the extracted features and known classification results.
  • the store information corresponding to the selected store sample may include review information, so that when the features are extracted, the review information may be used to obtain the first feature based on at least the time-related attributes of the review information, and based on the reviews and information contained in the review information.
  • the second feature is determined by the authenticity-related semantic description. In this way, it is possible to make full use of Internet data, extract effective training features, and train a classification model with higher accuracy.
  • the server uses the classification model trained by the computing platform to classify the stores to be classified.
  • the server may first obtain the corresponding store information of the store to be classified, where the store information includes review information, and then extract the characteristics of the store to be classified based on the store information to input the training model trained by the computing platform to obtain the output result of the classification model. And according to the output result, determine whether the store to be classified is currently a real store.
  • the features extracted by the server to be classified by the server also include the above-mentioned first features and second features extracted from the review information. In this way, it is possible to make full use of Internet data, extract effective features, improve the accuracy of store classification, and thereby make store classification results more effective.
  • the store information sent by the server to the client may include only store information of non-closed stores, or store information of all stores .
  • the store information sent by the server to the client may also include information on whether the store is closed.
  • FIG. 1 only shows a specific implementation scenario of an embodiment disclosed in this specification, but it does not limit the scope of the implementation scenarios of the embodiments of this specification. For example, in another implementation scenario, Including the client in Figure 1, and so on.
  • FIG. 2 shows a flowchart of a training method of a classification model according to an embodiment.
  • the execution subject of the method may be a system, equipment, device, platform or server with certain computing and data processing capabilities, such as the computing platform shown in FIG. 1.
  • the classification model involved in this method can be used to determine whether the store is currently a real store.
  • the method includes the following steps: Step 21: Select a predetermined number of store samples.
  • the store samples correspond to store information and classification labels.
  • the classification labels include real store labels and non-real store labels.
  • the store information includes comments.
  • step 22 extracting the features of the store sample based on the store information, wherein the above features include at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is based on the review information
  • the contained semantic description related to the authenticity of the store is determined; step 23, a classification model is trained based on the characteristics and classification tags of each store sample.
  • a predetermined number of store samples are selected, and the store samples correspond to store information and classification labels.
  • the classification label includes a real store label and a non-real store label. It is understandable that user reviews are often formed by the user ’s intuitive and real experience of the store. There is a real gap between the real store and the non-real store. For example, the non-real store may have no reviews or fewer reviews. . Therefore, the review information may have a large influence on the judgment of the classification of the store. In this way, the store information corresponding to the store sample may include at least review information.
  • the comment information may include comment content, comment time, number of comments, and so on.
  • the store information can be crawled from a predetermined website (for example, XX reviews, etc.) by a web crawler (such as python). For example, you can crawl user registration information or content distribution information in the predetermined website. Then, the store information can be obtained through the type of registered user (such as a store or a consumer) in the user registration information, the type of the content (such as a sale or a purchase) in the content distribution information, and the like. If the type of the published content is sale information, the user who posted the information may be the store side, from which the store name, store location, and review information can be obtained. In practice, you can also search on the electronic map based on information such as store name and store location to determine the classification label of the store. For example, stores that are not searchable on the electronic map are non-existent stores.
  • a sample of the store can also be collected manually offline, for example, by manually checking the store address on the website or map one by one to determine its classification label. At the same time, it can also be performed by phone, search engine, administrative At least one of the management department registration information, etc., to obtain the store information of the corresponding store.
  • the review information in the store information can be obtained by, for example, a phone call, a "question and answer" in a search engine, and the like.
  • store samples of known classification tags may also be obtained through acquisition channels that include more aspects, which are not described in detail here.
  • Store samples can include positive and negative samples. Among them, a positive sample may correspond to a real store label, and a negative sample may correspond to a non-real store label.
  • a store that has at least one of the following behaviors within a predetermined period can be selected as a positive sample: sales of vouchers, group purchase activities, promotional activities (such as discounts, etc.), reservation services, Q & A interaction , Advertising, receiving customer check-ins on the client, etc.
  • some sales methods may be used in store operations, such as selling vouchers, organizing group purchases, organizing promotional activities, etc.
  • Some stores (such as hotels, restaurants, etc.) can provide reservation services, and some stores will be available on related websites ( (Such as travel strategy websites, etc.) to conduct some Q & A interactions with consumers or potential consumers, and some stores will cooperate with some websites to place ads to increase page views or search rankings.
  • some stores can receive customers' check-ins in the store through an application (such as a food review website). If the customer clicks the check-in on the client's store page, the deviation between the check-in location and the store location is within a set distance range (such as 80 meters ), The sign-in is successful.
  • the store that provides the check-in may be a real store, and when the customer visits the store for consumption, the check-in is performed. Therefore, a store that has one of the above behaviors within the current or predetermined period can be determined as a positive sample, and these store samples that are positive samples can be assigned real store label.
  • a store that meets the following conditions may be selected as a negative sample: it is marked as permanently closed on the electronic map.
  • the store will be deleted from the map or marked as permanently closed. Therefore, you can use the store name and store location to search.
  • stores marked as permanently closed for electronic map applications use the electronic map to confirm that the store location is correct, and use them as negative samples, and assign these store samples that are negative samples to be non-real. Shop labels.
  • the store information corresponding to the store sample can also be obtained.
  • the store information may include, for example, a store name, a store address, and the like.
  • the store information may further include, but is not limited to, at least one of the following: basic store information, such as phone number, business hours, whether a wireless network connection is provided (such as wifi connection, etc.); store brand name, such as ⁇ ⁇ Etc .; shop labels given by the website or administrative supervision department, such as overseas food selection, local tourism bureau recommendations, etc .; shop classification, such as food, shopping, hotels, etc.
  • non-real stores are shops that have been permanently closed, and their number is often smaller than real stores.
  • down-sampling the obtained store samples with real store labels can be made to make the number of store samples with real store labels and store samples with non-real store labels approximately equal, for example, 45000 Each.
  • the features of the store sample are extracted based on the store information.
  • the above features include at least a first feature and a second feature.
  • the first feature is obtained based on at least the time-related attributes in the review information
  • the second feature is based on the semantic description related to the authenticity of the store included in the review information. And ok. It is worth noting that the "first” and “second” in the "first feature” and “second feature” are only used to distinguish between two different features, and do not indicate a sequence limitation.
  • the time-related attributes of the comment information may include, but are not limited to, at least one of the following: the time when the comment was posted (such as May 1, 2018, etc.), the length of the comment from the current time (such as 10 hours, 20 days, etc.), and the reservation
  • the number of comments (such as 100) in a time period (such as 2 days) and so on. It can be understood that a real store may constantly have new consumers to consume and comment. Therefore, the latest review time is often late, and the length of the review from the current time is small. At the same time, the number of reviews in the predetermined time period increases. It is more likely; instead of a real store, because there are no new consumers, the review time is earlier, the review is longer than the current time, and the possibility of increasing reviews within a predetermined period is less.
  • the first feature may include, but is not limited to, one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and the increment of the number of comments within a predetermined time period.
  • the latest review time may be the time of the latest review from the current time.
  • the comment time is at 20:00 on March 2, 2015.
  • the length of the latest comment from the current time can be the time difference between the current time and the latest comment time, such as 30 days.
  • the increment of the number of comments in a predetermined time period that is, the amount of change in the total number of comments every predetermined time period. For example, suppose the predetermined time period is 3 months.
  • the comment time count the total number of comments every 3 months from the current time and calculate the increment of the number of comments. If the total number of comments in the last 3 months is 1000, the most recent The 3-month review increment is 1000. In this way, it is possible to make full use of the time-related attribute data of the review information of the shop samples on the Internet.
  • the semantic description related to the authenticity of the store contained in the review information may be a semantic description containing the store closed or in a good business condition. For example, "the store is closed and no longer exists", it may be a semantic description that the store has been permanently closed.
  • different information such as the time of release may also mean different meanings. For example, for a restaurant, a comment "Da Lao Yuan came over and has been closed” may be expressed at 12 pm It means that the restaurant is closed, and the announcement at 12 noon may mean that the restaurant is closed.
  • a very small number of comments (such as 1) that contain the semantics of expressing a shop closure may indicate that the shop has been permanently closed. Therefore, the feature may include a second feature that can reflect whether the review information has a semantic description of the store being permanently closed.
  • the second feature may be expressed in words, for example, having a semantic description of the store permanently closed or including a semantic description related to the authenticity of the store, not having a semantic description of the store permanently closed or not including a semantic description related to the authenticity of the store, and so on.
  • the second feature may also be represented by a numerical value, for example, the second feature is 1 in the case of having a semantic description of the store permanently closed, the second feature is 0 in the case of having no semantic description of the store permanently closed, and so on.
  • the second feature can be extracted by the following methods: step 31, obtaining first review information corresponding to the first store sample; step 32, determining the first review using a pre-trained semantic model Semantic tags corresponding to each piece of review data in the information, wherein the semantic tags include closed or non-closed semantics; step 33, determine the second feature of the first store sample according to each semantic tag.
  • the "first” in the "first store sample” and “first review information” referred to here means “some”, “one of them”, “any one”, and the store samples and reviews Correspondence of information, not the order, or the distinction between store samples.
  • the review information of the store sample may be obtained first.
  • the review information of a shop sample may correspond to one or more pieces of review data.
  • Each review data may include a review content, a review time, and data such as a user ID who posted the review.
  • a pre-trained semantic model is used to determine the semantic label corresponding to each piece of review data in the review information.
  • each piece of comment data can correspond to a semantic tag.
  • Each piece of comment data can be input into a pre-trained semantic model, and the semantic label of a piece of comment data can be determined according to the output of the semantic model.
  • the semantic model can be trained through a pre-annotated comment set.
  • some reviews can be selected from the review data of multiple store samples and added to the review set, especially for review data containing review sentences such as "closed”, “closed”, etc., and determined through manual identification and labeling
  • the semantic labels of these review data are used as known semantic labels to train a supervised model, such as a logistic regression LR (logistics regression) model.
  • Model training is a process of determining model parameters with known inputs (such as comment sentences) and outputs (such as known semantic labels), and will not be repeated here.
  • the semantic label of the review data may include the semantics with or without closing semantics.
  • the output of the semantic model can be one of the semantic labels directly, or it can be a numerical value, such as 1, 0, and so on.
  • the output of the semantic model is one of two possible values (such as 1, 0, etc.), where each value corresponds to a semantic label, such as 1 corresponding to a closed business semantic label.
  • the output of the semantic model can also be one of multiple possible values (such as any decimal between 0-1, etc.).
  • a threshold can be set to determine which semantic label the output value is more biased to, such as greater than 0.6. Prefer to have closed semantic labels.
  • each word in the review data may be first expressed as a respective word vector through an unsupervised word vector model (such as the word2vec model); based on each word vector, it is determined A comment vector corresponding to the comment data; input the determined comment vector into a semantic model to obtain an output result of the semantic model; and add a semantic label to the comment data according to the output result.
  • the review vector corresponding to the review data is determined based on each word vector. For example, the review vector may be an average of different dimensions of each word vector, or a weighted average of different dimensions.
  • each word is represented as a word vector:
  • the comment vector corresponding to the comment data determined based on each word vector may be:
  • the number of occurrences of each vocabulary can also be used as a weight, and a weighted average of different latitudes of each word vector to obtain a comment vector is:
  • the 1 in front of each parameter is the number of occurrences of the corresponding vocabulary
  • the denominator is the sum of the number of occurrences of each vocabulary. In this example, the number of occurrences of each vocabulary is 1 and can be other values in practice.
  • the comment vector can be input to the semantic model, so as to obtain the output of the semantic model. Understandably, the comment vector can also be expressed as Each of them is input as a feature into the semantic model. Then, you can add semantic tags to the comment data according to the output of the semantic model. For example, the output of the semantic model is 1, and a semantic tag of "with closing semantics" is added to the comment data.
  • Step 33 Determine the second feature of the corresponding store sample according to each semantic tag corresponding to the store sample.
  • the second feature may be determined as having a storehouse permanent shutdown semantic description or including a semantic description related to the authenticity of the store, without a store permanent shutdown semantic description or including a storefront-related semantic description, a value of 1, 0, etc .
  • the second feature of determining the store sample is that the store includes a non-real existence The semantics of the store.
  • a number threshold may also be set, and the second characteristic of the store sample is determined only when the number of comment data of tags with the semantics of closing business exceeds the number threshold (such as 10). To include the semantics that the store is not a real store.
  • the characteristics of the shop sample may also include the number of reviews, such as the total number of reviews, the number of positive reviews, the number of positive reviews, the ratio of the number of negative reviews, and reviews. Number of pictures in, etc. It can be understood that for shops with a large proportion of negative reviews, it is more likely to be non-real shops; for shops with a large total number of reviews or a large number of pictures in the reviews, it is likely to be real shops Sex. Therefore, the feature of the number of reviews can be used as a factor that influences whether a store is classified as a real store.
  • the characteristics of the store sample may further include basic information completeness characteristics.
  • Basic information such as telephone, business hours, whether wireless network connection is available (such as wifi, etc.), service facilities and so on. The more complete the basic information is, the more likely it is that the store exists.
  • the basic information completeness may be proportional to the number of basic information items. Therefore, the basic information completeness feature can be used as a factor that influences whether the store is classified as a real store.
  • the characteristics of the store sample may further include predetermined identification characteristics.
  • the predetermined identifier may be, for example, a preferred label (such as a local tourism bureau recommendation label) given by a brand store, a chain store, a website, or an administrative agency. Understandably, brand stores or chain stores often refer to stores with high visibility and market recognition. These stores are more likely to be real stores. Websites or administrative agencies give preferred labels to stores that have passed audits and inspections. These stores are more likely to be real stores. Therefore, the predetermined identification feature can be used as a factor that influences whether the store is classified as a real store.
  • the characteristics of the store sample may further include store operation category characteristics.
  • the store management category may be, for example, food, hotel, clothing, and so on. In some websites, there are many reviews for gourmet shops. If you only classify by the number of reviews, the accuracy is low. Therefore, you can also treat the shops in different business categories differently, and treat the shops with fewer reviews in the business category. , Give greater weight.
  • the characteristics of the store sample may also include consumer scoring characteristics.
  • Consumer ratings can be either points or star ratings. It is worth noting that if the store samples are obtained from the same website and the consumer scores have the same standard, the consumer scores can be directly used as the consumer score characteristics. If the store samples are not obtained from the same website, and the scoring standards may also be different, the ratio of consumer scores to full marks can be used as a consumer scoring feature. Consumer ratings will affect the customer flow of the store. If the customer flow is low, it is more likely to become a non-real store. Therefore, the consumer scoring characteristics can be used to influence whether the store is currently a real store. A factor in classification.
  • the features of the store sample may also include more features, which will not be exemplified here.
  • the classification model is trained based on the characteristics and classification labels of each store sample.
  • the process of model training is the process of determining model parameters based on known input features and classification results.
  • the input feature is the feature of the store sample, where the feature includes multiple input features
  • the classification result is determined according to the classification label of the store sample.
  • the output result includes 0, 1, 0 is a real store label, and 1 is not real.
  • Store labels exist, and so on.
  • a store sample corresponds to a set of known input features and classification results.
  • the known input features input in the input layer 42 are the characteristics of each store sample, and the output results of the output layer 43 can be compared with the classification labels of the corresponding store samples. According to the comparison result, various parameters of the intermediate layer 44 are adjusted, and weight parameters represented by the arrows between the features of the input layer 42 and the intermediate layer 44 and between the arrows of the intermediate layer 44 and the output layer 43 are represented.
  • the known input features input by the input layer 42 include a first feature 421 and a second feature 422, and the first feature 421 and the second feature 422 are respectively obtained from the review information 411 related data in the store information 41.
  • store samples can be divided into training samples and test samples.
  • the features of each training sample are used as input in order, and each classification parameter of the classification model is adjusted according to the comparison between the output of the classification model and the classification label, so that the output of the classification model is classified with the currently input training sample.
  • the labels are more consistent to train the classification model.
  • the features of each test sample are input into the classification model trained by the training sample, and the classification labels corresponding to the test samples are used to detect the accuracy of each output result of the classification model to obtain the detection result of the classification model. For example, if the output of the classification label and the classification model are consistent, it is determined that the output of the classification model is correct. In this way, the detection results of the classification model on the entire test sample, such as accuracy, can be obtained.
  • the classification model may be further adjusted according to the detection result. For example, adjust the grid structure of the classification model, change the classification model, and so on. For example, when the classification model is a GBDT model of gradient boosted decision tree, the number of trees, the depth of each tree, and the learning rate can be adjusted. After adjusting the classification model, use the training samples to train the classification model again, and use the test samples to obtain the detection results of the classification model. Until the test sample meets the preset conditions.
  • the preset condition here may be a condition set on a detection result of the classification model.
  • the detection result may include values of the area under the curve, AUC, accuracy, recall, F1 score, and so on.
  • the preset conditions are that the accuracy and recall rate are both greater than 0.7 and so on.
  • AUC 0.868
  • accuracy 0.767
  • recall rate 0.803
  • F1 is 0.784.
  • the store information corresponding to the selected store sample includes review information. Therefore, the features extracted from the store information may include at least: a first feature obtained based on the time-related attributes of the review information, based on the review information The second feature determined by the semantic description related to the authenticity of the store. In this way, training a classification model based on features including the first feature and the second feature can make full use of Internet data to train a classification model with higher accuracy, thereby improving the effectiveness of store classification.
  • a method for classifying a store is also provided. It is used to determine whether the store is a real store through a classification model. This method is suitable for an electronic device with a certain data processing capability, such as the server in FIG. 1.
  • the embodiment of the method for classifying a store includes the steps of: step 51, obtaining store information of a store to be classified, where the store information includes review information; step 52, extracting characteristics of the store to be classified based on the store information,
  • the feature includes at least a first feature and a second feature.
  • the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information;
  • step 53 Inputting the characteristics of the store to be classified into the classification model to obtain an output result of the classification model;
  • step 54 determine whether the store to be classified is currently a true store according to the output result.
  • step 51 store information of a store to be classified is obtained.
  • the store information includes at least review information, such as review content, review time, and number of reviews.
  • the store information may also include but is not limited to at least one of the following: basic store information, store brand name, store label given by the website or administrative supervision department, store classification, etc.
  • Store information can be crawled from a predetermined website (such as ⁇ comments, etc.) through a web crawler (such as python).
  • the features of the store to be classified are extracted based on the store information.
  • the features here correspond to the input features of the classification model.
  • the feature includes at least a first feature and a second feature.
  • the first feature is obtained based on at least the time-related attributes of the review information
  • the second feature is determined based on the semantic description related to the authenticity of the store included in the review information. It is worth noting that the "first" and “second” in the "first feature” and “second feature” are only used to distinguish between two different features, and do not indicate a sequence limitation.
  • the time-related attributes of the review information may include, but are not limited to, at least one of the following: a review posting time, a duration of the review from the current time, a number of reviews in a predetermined time period, and the like.
  • the first feature may include, but is not limited to, one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and the increment of the number of comments within a predetermined time period. In this way, it is possible to make full use of the time-related attribute data of the review information of the shop samples on the Internet.
  • the semantic description related to the authenticity of the store contained in the review information may be a semantic description containing the store closed or in a good business condition.
  • a very small number of comments such as one
  • the second feature can be expressed in words or numerically.
  • the second feature can be extracted by: obtaining the review information of the store to be classified; using a pre-trained semantic model to determine the corresponding semantic tag of each piece of review data in the review information, wherein the semantic tag includes a closed Semantic or does not have closing semantics; the second feature of the store to be classified is determined according to each semantic tag corresponding to the store to be classified.
  • each review data may include a review content, a review time, and data such as a user ID who posted the review.
  • Each piece of review data can be input into a pre-trained semantic model, and the semantic label of each piece of review data is determined based on the output of the semantic model. Then, the second feature of the store to be classified is determined according to these semantic tags.
  • each word in the review data may be first expressed as a respective word vector through an unsupervised word vector model (such as the word2vec model); based on each word vector, it is determined A comment vector corresponding to the comment data; input the determined comment vector into a semantic model to obtain an output result of the semantic model; and add a semantic label to the comment data according to the output result.
  • an unsupervised word vector model such as the word2vec model
  • the second feature of determining the store to be classified is to include the semantics that the store is not a real store .
  • a number threshold may also be set. When the number of comment data of tags with a closing semantics exceeds the number threshold, it is determined that the second characteristic of the store sample is that the store is not real. The semantics of the store.
  • the characteristics of the store to be classified may include, but are not limited to, at least one of the following: the number of reviews, the basic information completeness feature, the predetermined identification feature, and the store operation category feature , Consumer scoring characteristics, and more.
  • Step 53 Input the characteristics of the store to be classified into a classification model to obtain an output result of the classification model.
  • the output of the classification model can be a numerical value or a classification label.
  • the classification label may include a real store label and a non-real store label.
  • the features of the store to be classified extracted from the store 41 are input to the input layer 42, where the features include the first feature 421 and the second feature 422 extracted through the review information 411. After passing through the intermediate layer 44, an output result is obtained from the output layer 43.
  • Step 54 Determine whether the store to be classified is a real store currently according to the output result.
  • the output result is a classification label
  • the output result is directly determined whether the store to be classified is a real store according to the classification label, and the store to be classified with a real store label is a real store, otherwise it is a non-real store.
  • the output result is a numerical value
  • the classification label corresponding to whether the store to be classified is a real store exists according to the corresponding value.
  • the classification label of the store to be classified can be determined according to which end the value is biased to.
  • it can be determined according to a set threshold value. For example, if the threshold value set to 1 is 0.6, values greater than 0.6 are all values biased to 1, which can correspond to the classification labels of non-existing stores.
  • the method for classifying a store is performed by using a classification model trained in the embodiment of FIG. 2. Therefore, in the embodiment shown in FIG. The related description is also applicable to the corresponding content of the store to be classified mentioned in the embodiment shown in FIG. 5, and details are not described herein again.
  • FIG. 6 shows a schematic block diagram of a training apparatus for a classification model according to an embodiment.
  • the apparatus 600 for training a classification model includes a selection unit 61 configured to select a predetermined number of store samples.
  • the store samples correspond to store information and classification labels.
  • the classification labels include real store labels and non-real ones.
  • the shop information includes review information
  • the extraction unit 62 is configured to extract features of the shop sample based on the shop information, wherein the aforementioned features include at least a first feature and a second feature, and the first feature is based at least on time-related attributes of the review information And obtained, the second feature is determined based on the semantic description related to the authenticity of the store included in the review information
  • the training unit 63 is configured to train a classification model based on the characteristics and classification tags of each store sample.
  • the store sample may include a positive sample and a negative sample, where the positive sample corresponds to a real store label and the negative sample corresponds to a non-real store label.
  • the selection unit 61 may be configured to select, as a positive sample, stores that have at least one of the following behaviors within a predetermined period: sales vouchers, group purchase activities, promotional activities, reservation services, Q & A interactions, advertisements Place and receive customer sign-in on the client.
  • the selecting unit 61 may be further configured to: select a store that meets the following conditions as a negative sample: it is marked as permanently closed on the electronic map.
  • the first feature may include one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.
  • the extraction unit 62 may further include: a review information acquisition module configured to acquire the first review information of the first store sample; a semantic label determination module configured to utilize a pre-trained The semantic model determines the semantic tags corresponding to each piece of review data in the first review information, wherein the semantic tags include closed or non-closed semantics; a second feature determination module configured to determine the The second feature. It is worth noting that the "first" and “second” in the "first feature" and “second feature” are only used to distinguish between two different features, and do not indicate a sequence limitation.
  • the second feature determination module may be further configured to: in a case where each semantic tag corresponding to the first store sample includes a tag with a closing semantics, determine the second feature of the first store sample as including that the store is not Store semantics really exist.
  • the "first” in the "first store sample” and “first review information” referred to here means “some”, “one”, “any”, and the corresponding relationship between the store sample and the review information, It does not indicate order or distinction between store samples.
  • the semantic label determination module may be further configured to: for the first review data in the first review information, each word in the first review data is represented as each word vector through an unsupervised word vector model; based on each word vector, determine A first review vector corresponding to the first review data; inputting the first review vector into the semantic model to obtain an output result of the semantic model; and adding a semantic label to the first review data according to the output result.
  • the above-mentioned features may further include, but are not limited to, at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.
  • the store samples include training samples and test samples
  • the training unit 63 may include: a training module configured to take features of each training sample as input, and according to an output result of the classification model and the classification label In comparison, adjust each classification parameter of the classification model to train the classification model; the test module is configured to input the characteristics of each test sample into the classification model trained by the training sample, and detect the classification using the classification label corresponding to the test sample The accuracy of each output result of the model to obtain the detection result of the classification model; the adjustment module is configured to adjust the classification model according to the detection result if the detection result does not satisfy a preset condition. For example, adjust the grid structure of the classification model, change the classification model, and so on.
  • the preset condition here may be an evaluation parameter condition for the classification model.
  • the model evaluation parameters may include the area under the curve, AUC, accuracy, recall rate, F1 score, and so on.
  • the apparatus 600 shown in FIG. 6 corresponds to the method shown in FIG. 2. Therefore, the related description in FIG. 2 is also applicable to the apparatus 600, and details are not described herein again.
  • a device for classifying a store is also provided.
  • Fig. 7 shows a schematic block diagram for a store classification device according to one embodiment.
  • the apparatus 700 for sorting a store includes: an obtaining unit 71 configured to obtain store information of a store to be classified, wherein the store information includes review information; and an extracting unit 72 configured to extract the store to be classified based on the store information Feature, wherein the feature includes at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information;
  • the classification unit 73 is configured to input characteristics of the store to be classified into the model to obtain an output result of the classification model; and the determination unit 74 is configured to determine whether the store to be classified is a real store currently based on the output result.
  • the first feature may include one or more of the following: the time of the most recent comment, the length of time of the most recent comment from the current time, and the increment of the number of comments within a predetermined time period.
  • the second feature may be extracted by: obtaining first review information of a first store sample; and using a pre-trained semantic model to determine a semantic label corresponding to each piece of review data in the first review information, where
  • the semantic tags include the semantics of going out of business or no semantics of going out of business; the second feature of the first store sample is determined according to each semantic tag.
  • the second feature of determining the first store sample is to include the semantics that the store is not a real store .
  • using a pre-trained semantic model to determine the semantic label of each piece of review data in the review information includes: for the first review data in the first review information, using the unsupervised word vector model to convert the first review data Each word in the word is represented as each word vector; based on each word vector, a first review vector corresponding to the first review data is determined; the first review vector is input to a semantic model to obtain an output result of the meaning model; according to the output result, The first review data is semantically tagged.
  • the above features may further include at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.
  • the Internet data can be fully utilized to extract effective classification features, thereby improving the effectiveness of store classification.
  • FIG. 7 corresponds to the method shown in FIG. 5. Therefore, the related description in FIG. 5 is also applicable to the apparatus 700, and details are not described herein again.
  • a computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2 or FIG. 5.
  • a computing device which includes a memory and a processor.
  • the memory stores executable code.
  • the processor executes the executable code, the implementation is implemented in combination with FIG. 2 or FIG. 5. The method described.
  • the functions described in the present invention may be implemented by hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored in or transmitted over as one or more instructions or code on a computer-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A classification model training method and a store classification method and device. During the training of a classification model, store information corresponding to a selected store sample comprises review information. The store information is used to extract features of the store sample, the features comprising: a first feature obtained at least on the basis of a time-related attribute of the review information; and a second feature determined on the basis of a semantic description comprised in the review information and related to the existence or non-existence of the store. When the trained classification model is used to perform store classification, features extracted from stores to be classified also comprise the first feature and the second feature. In this way, internet data can be fully utilized to improve the effectiveness of store classification.

Description

分类模型的训练方法、店铺分类的方法及装置Classification model training method, store classification method and device 技术领域Technical field
本说明书一个或多个实施例涉及计算机技术领域,尤其涉及通过计算机分类模型的训练方法、店铺分类的方法和装置。One or more embodiments of the present specification relate to the field of computer technology, and in particular, to a training method of a computer classification model, a method and a device of store classification.
背景技术Background technique
随着计算机和互联网技术的发展,人们生活中接触的网络平台或应用越来越多,例如交友应用、购物应用、订餐应用、地图应用等等。其中,用户在使用一些可以推荐店铺的应用(如订餐应用、地图应用等)时,这些应用对店铺的营业状态(如是否停业)的描述十分重要。例如,如果用户想吃麻辣烫,按照地图搜索附近有一家麻辣烫的店铺,按照地图走过去却发现店铺已停业,将会给用户造成不好的体验。With the development of computer and Internet technologies, more and more network platforms or applications are in contact in people's lives, such as dating applications, shopping applications, ordering applications, map applications, and so on. Among them, when users use some applications that can recommend stores (such as ordering applications, map applications, etc.), these applications are very important to describe the business status of the stores (such as whether they are closed). For example, if a user wants to eat Mala Tang, search for a nearby Mala Tang store according to the map, and walk along the map but find that the store is closed, which will cause a bad experience for the user.
因此,需要充分利用互联网数据,通过提取有效的训练特征,训练准确度较高的分类模型,确定出哪些是停业店铺,从而提高店铺分类的有效性。Therefore, it is necessary to make full use of Internet data, and by extracting effective training features and training a classification model with high accuracy, determine which shops are closed, thereby improving the effectiveness of shop classification.
发明内容Summary of the invention
本说明书一个或多个实施例描述了一种方法和装置,可以充分利用互联网数据,通过提取有效的训练特征,训练准确度较高的分类模型,在店铺分类时,准确确定出哪些是停业店铺,从而提高店铺分类的有效性。One or more embodiments of the present specification describe a method and device that can make full use of Internet data, and by extracting effective training features, train a classification model with higher accuracy, and accurately determine which stores are closed when the store is classified. , Thereby improving the effectiveness of store classification.
根据第一方面,提供了一种分类模型的训练方法,所述分类模型用于判断店铺当前是否为真实存在的店铺,包括:选择预定数量的店铺样本,所述店铺样本对应有店铺信息和分类标签,所述分类标签包括真实存在店铺标签和非真实存在店铺标签,所述店铺信息包括评论信息;基于所述店铺信息提取所述店铺样本的特征,其中,所述特征至少包括第一特征和第二特征,所述第一特征至少基于所述评论信息的时间相关属性而获取,所述第二特征基于所述评论信息中包含的与店铺真实性相关的语义描述而确定;基于各个店铺样本的所述特征和所述分类标签训练所述分类模型。According to a first aspect, a training method for a classification model is provided. The classification model is used to determine whether a store is currently a real store, including: selecting a predetermined number of store samples, the store samples corresponding to store information and classification A label, the classification label includes a real store label and a non-real store label, the store information includes review information, and features of the store sample are extracted based on the store information, wherein the features include at least a first feature and The second feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; based on each store sample Training the classification model with the features and the classification labels.
在一个实施例中,选择预定数量的店铺样本包括:选择预定期限内具有以下至少一项行为的店铺作为正样本:销售代金券、团购活动、促销活动、订座服务、问答互动、广告投放、接收到顾客在客户端的签到,其中,所述正样本对应有真实存在店铺标签。In one embodiment, selecting a predetermined number of store samples includes: selecting, as a positive sample, stores that have at least one of the following behaviors within a predetermined period: sales vouchers, group purchases, promotions, reservation services, Q & A interactions, advertising, A check-in of the customer at the client is received, wherein the positive sample corresponds to a real store label.
在一个实施例中,选择预定数量的店铺样本包括:选择满足以下条件的店铺作为负样本:在电子地图上被标注为永久停业,其中,所述负样本对应有非真实存在店铺标签。In one embodiment, selecting a predetermined number of store samples includes: selecting a store that meets the following conditions as a negative sample: it is marked as permanently closed on the electronic map, and the negative sample corresponds to a non-real existence store label.
在一个可能的实施例中,所述第一特征包括以下中的一项或多项:最新评论的时间、最新评论距离当前时间的时长、预定时间段内的评论数增量。In a possible embodiment, the first feature includes one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.
根据一种可能的设计,所述第二特征通过以下方法提取:获取与第一店铺样本对应的所述第一评论信息;利用预先训练的语义模型确定所述第一评论信息中各条评论数据分别对应的语义标签,其中,所述语义标签包括具有停业语义或不具有停业语义;按照各个语义标签确定所述第一店铺样本的第二特征。According to a possible design, the second feature is extracted by: obtaining the first review information corresponding to a first store sample; and using a pre-trained semantic model to determine each piece of review data in the first review information Respectively corresponding semantic labels, wherein the semantic labels include closed semantics or non-closed semantics; and determine the second feature of the first store sample according to each semantic label.
进一步地,在一种实现中,所述按照各个语义标签确定所述第一店铺样本的第二特征包括:在各个语义标签中包含具有停业语义的标签的情况下,确定所述第一店铺样本的第二特征为,包含店铺为非真实存在店铺的语义。Further, in an implementation, determining the second feature of the first store sample according to each semantic tag includes: determining the first store sample in a case where each semantic tag includes a tag with a closing semantics. The second feature is that it contains the semantics that the store is not a real store.
在一个实施例中,所述语义模型包括,通过标注的评论数据集训练的监督模型。In one embodiment, the semantic model includes a supervised model trained on a labeled review dataset.
在一个可能的实施例中,利用预先训练的语义模型确定所述第一评论信息中各条评论数据分别对应的语义标签包括:针对所述第一评论信息中的第一评论数据,通过无监督词向量模型将所述第一评论数据中的各个词分别表示成各个词向量;基于所述各个词向量,确定所述第一评论数据对应的第一评论向量;将所述第一评论向量输入所述语义模型,以获取所述语义模型的输出结果;按照所述输出结果为所述第一评论数据添加语义标签。In a possible embodiment, using a pre-trained semantic model to determine the corresponding semantic tag of each piece of review data in the first review information includes: for the first review data in the first review information, through unsupervised A word vector model represents each word in the first review data as a respective word vector; based on the respective word vectors, determining a first review vector corresponding to the first review data; and inputting the first review vector The semantic model to obtain an output result of the semantic model; and adding a semantic label to the first comment data according to the output result.
在一种实施方式中,所述特征还包括以下至少一个特征:评论数量特征、基本信息完备度特征、预定标识特征、店铺经营类别特征、消费者打分特征。In one embodiment, the features further include at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.
根据一个可能的实施例,所述店铺样本还包括测试样本,以及,所述方法还包括:检测所述分类模型针对各个测试样本的各个输出结果的准确性,以根据各个输出结果的准确性获得对所述分类模型的检测结果;根据所述检测结果调整所述分类模型,直至所述检测结果满足预设条件。According to a possible embodiment, the store sample further includes a test sample, and the method further includes: detecting the accuracy of each output result of the classification model for each test sample, to obtain according to the accuracy of each output result A detection result of the classification model; and adjusting the classification model according to the detection result until the detection result meets a preset condition.
根据第二方面,提供一种店铺分类的方法,利用第一方面任一方法训练的分类模型,判断店铺当前是否为真实存在的店铺,所述方法包括:获取待分类店铺的店铺信息,其中,所述店铺信息包括评论信息;基于所述店铺信息提取所述待分类店铺的特征,其中,所述特征至少包括第一特征和第二特征,所述第一特征至少基于所述评论信息的时间相关属性而获取,所述第二特征基于所述评论信息中包含的与店铺真实性相关的语义描述 而确定;将所述待分类店铺的所述特征输入所述分类模型,以获取所述分类模型的输出结果;根据所述输出结果确定所述待分类店铺当前是否为真实存在的店铺。According to a second aspect, a method for classifying a store is provided, using the classification model trained in any of the methods of the first aspect to determine whether a store is currently a real store, the method includes: obtaining store information of a store to be classified, wherein, The store information includes review information; features of the store to be classified are extracted based on the store information, wherein the features include at least a first feature and a second feature, and the first feature is based at least on a time of the review information Related attributes are obtained, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; the feature of the store to be classified is input into the classification model to obtain the classification An output result of the model; determining whether the store to be classified is a real store currently according to the output result.
根据第三方面,提供一种分类模型的训练装置,所述分类模型用于判断店铺当前是否为真实存在的店铺,所述装置包括:选择单元,配置为选择预定数量的店铺样本,所述店铺样本对应有店铺信息和分类标签,所述分类标签包括真实存在店铺标签和非真实存在店铺标签,所述店铺信息包括评论信息;提取单元,配置为基于所述店铺信息提取所述店铺样本的特征,其中,所述特征至少包括第一特征和第二特征,所述第一特征至少基于所述评论信息的时间相关属性而获取,所述第二特征基于所述评论信息中包含的与店铺真实性相关的语义描述而确定;训练单元,配置为基于各个店铺样本的所述特征和所述分类标签训练所述分类模型。According to a third aspect, a training device for a classification model is provided. The classification model is used to determine whether a store is currently a real store. The device includes a selection unit configured to select a predetermined number of store samples. The samples correspond to store information and classification labels, the classification labels including real store labels and non-real store labels, the store information including review information, and an extraction unit configured to extract features of the store sample based on the store information , Wherein the feature includes at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is based on the authenticity with the store contained in the review information It is determined based on the sex-related semantic description; a training unit configured to train the classification model based on the characteristics and the classification tags of each store sample.
根据第四方面,提供一种店铺分类的装置,利用第三方面的训练装置训练的分类模型,判断店铺当前是否为真实存在的店铺,所述装置包括:获取单元,配置为获取待分类店铺的对应有店铺信息,其中,所述店铺信息包括评论信息;提取单元,配置为基于所述店铺信息提取所述待分类店铺的特征,其中,所述特征至少包括第一特征和第二特征,所述第一特征至少基于所述评论信息的时间相关属性而获取,所述第二特征基于所述评论信息中包含的与店铺真实性相关的语义描述而确定;分类单元,配置为将所述待分类店铺的所述特征输入所述分类模型,以获取所述分类模型的输出结果;确定单元,配置为根据所述输出结果确定所述待分类店铺当前是否为真实存在的店铺。According to a fourth aspect, a device for classifying a store is provided. The classification model trained by the training device of the third aspect is used to determine whether a store is currently a real store. The device includes: an obtaining unit configured to obtain the information of a store to be classified. Corresponding to store information, where the store information includes review information; an extraction unit configured to extract features of the store to be classified based on the store information, wherein the features include at least a first feature and a second feature, so The first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; the classification unit is configured to set the waiting information The feature of the classified store is input to the classification model to obtain an output result of the classification model; a determining unit is configured to determine whether the to-be-categorized store is currently a true store according to the output result.
根据第五方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面或第二方面的方法。According to a fifth aspect, there is provided a computer-readable storage medium having stored thereon a computer program, which when executed in a computer, causes the computer to execute the method of the first aspect or the second aspect.
根据第六方面,提供了一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面或第二方面的方法。According to a sixth aspect, there is provided a computing device including a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the first aspect or the first aspect is implemented. Two ways.
通过本说明书实施例提供的方法和装置,在训练分类模型时,所选择的店铺样本对应的店铺信息包括评论信息,通过店铺信息提取店铺样本的特征包括至少基于评论信息的时间相关属性而获取的第一特征,以及基于评论信息中包含的与店铺真实性相关的语义描述而确定的第二特征,如此,可以充分利用互联网数据,提取有效的训练特征,训练准确度较高的分类模型。在利用训练的分类模型对店铺分类时,所提取的待分类店铺的特征同样包括上述第一特征和第二特征,如此,可以充分利用互联网数据,提高店铺分类的准确度,进而提高店铺分类的有效性。With the method and device provided in the embodiments of the present specification, when training the classification model, the store information corresponding to the selected store sample includes review information, and the features of the store sample extracted from the store information include information obtained based on at least time-related attributes of the review information. The first feature and the second feature determined based on the semantic description related to the authenticity of the store included in the review information. In this way, the Internet data can be fully utilized to extract effective training features and train a classification model with higher accuracy. When using the trained classification model to classify the stores, the extracted features of the stores to be classified also include the above-mentioned first and second features. In this way, the Internet data can be fully utilized to improve the accuracy of the store classification, and thereby improve the store classification. Effectiveness.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings used in the description of the embodiments are briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the present invention. Those of ordinary skill in the art can also obtain other drawings according to these drawings without paying creative labor.
图1示出本说明书披露的一个实施例的实施场景示意图;FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification; FIG.
图2示出根据一个实施例的分类模型的训练方法的流程图;2 shows a flowchart of a training method of a classification model according to an embodiment;
图3示出第二特征提取的一个具体例子;FIG. 3 shows a specific example of the second feature extraction;
图4示出模型训练过程的一个具体例子;FIG. 4 shows a specific example of the model training process;
图5示出根据一个实施例的店铺分类方法的流程图;FIG. 5 shows a flowchart of a store classification method according to an embodiment; FIG.
图6示出根据一个实施例的分类模型的训练装置的示意性框图;6 shows a schematic block diagram of a training device for a classification model according to an embodiment;
图7示出根据一个实施例的店铺分类装置的示意性框图。FIG. 7 shows a schematic block diagram of a store classification device according to an embodiment.
具体实施方式detailed description
下面结合附图,对本说明书提供的方案进行描述。The solutions provided in this specification are described below with reference to the drawings.
图1为本说明书披露的一个实施例的实施场景示意图。如图所示,用户可以通过客户端应用,例如地图应用、购物应用、订餐应用等等,查看店铺信息。这里的客户端应用可以运行在具有数据处理能力的各种终端设备上,例如智能手机、平板电脑、台式计算机、智能手表等等。客户端应用上展示的店铺信息通过服务器提供。服务器可以是具有一定数据处理能力的处理设备,也可以是处理设备集群。计算平台训练出分类模型,服务器利用该分类模型对店铺进行分类,确定店铺是否为真实存在的店铺,并通过客户端应用展示给用户。可以理解,这里的真实存在,是指店铺为真实店铺,而且没有永久停业、破产等状况,其不包括短暂(如两天)停止营业的情况。FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification. As shown in the figure, users can view store information through client applications, such as map applications, shopping applications, ordering applications, and so on. The client application here can run on various terminal devices with data processing capabilities, such as smart phones, tablet computers, desktop computers, smart watches, and so on. The store information displayed on the client application is provided through the server. The server may be a processing device with a certain data processing capability, or a processing device cluster. The computing platform trains a classification model, and the server uses the classification model to classify the store, determine whether the store is a real store, and display it to the user through a client application. It can be understood that the real existence here refers to the fact that the store is a real store, and there is no permanent closure or bankruptcy. It does not include the case of a short (such as two days) suspension of business.
值得说明的是,计算平台可以设置在服务器中,也可以是独立于服务器的处理设备,本申请对此不作限定。计算平台训练出的分类模型可以被服务器重复利用。服务器通过分类模型对店铺进行分类的结果也可以重复利用。It is worth noting that the computing platform may be set in a server or a processing device independent of the server, which is not limited in this application. The classification model trained by the computing platform can be reused by the server. The results of the server's classification of the store through the classification model can also be reused.
计算平台可以首先选择预定数量的店铺样本、对店铺样本进行特征提取,然后根据 提取的特征和已知的分类结果,训练分类模型。其中,所选择的店铺样本对应的店铺信息可以包括评论信息,从而在提取特征时可以利用评论信息,至少基于评论信息的时间相关属性而获取到第一特征,以及基于评论信息中包含的与店铺真实性相关的语义描述而确定第二特征。如此,可以充分利用互联网数据,提取有效的训练特征,训练准确度较高的分类模型。The computing platform may first select a predetermined number of store samples, perform feature extraction on the store samples, and then train a classification model based on the extracted features and known classification results. Wherein, the store information corresponding to the selected store sample may include review information, so that when the features are extracted, the review information may be used to obtain the first feature based on at least the time-related attributes of the review information, and based on the reviews and information contained in the review information. The second feature is determined by the authenticity-related semantic description. In this way, it is possible to make full use of Internet data, extract effective training features, and train a classification model with higher accuracy.
服务器利用计算平台训练的分类模型,可以针对待分类店铺进行分类。服务器可以先获取待分类店铺的对应店铺信息,其中,店铺信息包括评论信息,然后,基于店铺信息提取待分类店铺的特征,以输入通过计算平台训练的上述训练模型,获取分类模型的输出结果,并根据输出结果确定待分类店铺当前是否为真实存在的店铺。相应地,服务器对待分类店铺提取的特征也包括上述从评论信息中提取的第一特征和第二特征。如此,可以充分利用互联网数据,提取有效的特征,提高店铺分类的准确度,从而使店铺分类结果更有效。The server uses the classification model trained by the computing platform to classify the stores to be classified. The server may first obtain the corresponding store information of the store to be classified, where the store information includes review information, and then extract the characteristics of the store to be classified based on the store information to input the training model trained by the computing platform to obtain the output result of the classification model. And according to the output result, determine whether the store to be classified is currently a real store. Correspondingly, the features extracted by the server to be classified by the server also include the above-mentioned first features and second features extracted from the review information. In this way, it is possible to make full use of Internet data, extract effective features, improve the accuracy of store classification, and thereby make store classification results more effective.
当用户通过客户端应用,例如地图应用、购物应用、订餐应用等等,查看店铺信息时,服务器向客户端发送的店铺信息可以只包括未停业店铺的店铺信息,也可以包括所有店铺的店铺信息。当服务器向客户端发送的店铺信息包括所有店铺的店铺信息时,店铺信息中还可以包括店铺是否停业的信息。When the user views the store information through a client application, such as a map application, shopping application, ordering application, etc., the store information sent by the server to the client may include only store information of non-closed stores, or store information of all stores . When the store information sent by the server to the client includes store information of all stores, the store information may also include information on whether the store is closed.
值得说明的是,图1只示出了本说明书披露的一个实施例的一个具体实施场景,但并不以此限定本说明书实施例的实施场景范围,例如,在另一个实施场景中,可以不包括图1中的客户端,等等。It is worth noting that FIG. 1 only shows a specific implementation scenario of an embodiment disclosed in this specification, but it does not limit the scope of the implementation scenarios of the embodiments of this specification. For example, in another implementation scenario, Including the client in Figure 1, and so on.
下面描述上述场景的具体执行过程。The specific execution process of the above scenario is described below.
图2示出根据一个实施例的分类模型的训练方法流程图。该方法的执行主体可以是具有一定计算、数据处理能力的系统、设备、装置、平台或服务器,例如图1所示的计算平台。该方法涉及的分类模型可以用于判断店铺当前是否为真实存在的店铺。FIG. 2 shows a flowchart of a training method of a classification model according to an embodiment. The execution subject of the method may be a system, equipment, device, platform or server with certain computing and data processing capabilities, such as the computing platform shown in FIG. 1. The classification model involved in this method can be used to determine whether the store is currently a real store.
如图2所示,该方法包括以下步骤:步骤21,选择预定数量的店铺样本,店铺样本对应有店铺信息和分类标签,分类标签包括真实存在店铺标签和非真实存在店铺标签,店铺信息包括评论信息;步骤22,基于店铺信息提取店铺样本的特征,其中,上述特征至少包括第一特征和第二特征,第一特征至少基于上述评论信息的时间相关属性而获取,第二特征基于评论信息中包含的与店铺真实性相关的语义描述而确定;步骤23,基于各个店铺样本的特征和分类标签训练分类模型。As shown in FIG. 2, the method includes the following steps: Step 21: Select a predetermined number of store samples. The store samples correspond to store information and classification labels. The classification labels include real store labels and non-real store labels. The store information includes comments. Information; step 22, extracting the features of the store sample based on the store information, wherein the above features include at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is based on the review information The contained semantic description related to the authenticity of the store is determined; step 23, a classification model is trained based on the characteristics and classification tags of each store sample.
首先,在步骤21,选择预定数量的店铺样本,店铺样本对应有店铺信息和分类标签。这里,分类标签包括真实存在店铺标签和非真实存在店铺标签。可以理解,用户评论往往是用户通过对店铺的直观、真实体验形成的感受,真实存在店铺和非真实存在店铺,他们的评论信息差距可能非常明显,例如,非真实店铺可能没有评论或者评论较少。因此,评论信息对于店铺的分类的判断可能具有较大影响。如此,店铺样本对应的店铺信息至少可以包括评论信息。其中,评论信息可以包括评论内容、评论时间、评论数量等等。First, in step 21, a predetermined number of store samples are selected, and the store samples correspond to store information and classification labels. Here, the classification label includes a real store label and a non-real store label. It is understandable that user reviews are often formed by the user ’s intuitive and real experience of the store. There is a real gap between the real store and the non-real store. For example, the non-real store may have no reviews or fewer reviews. . Therefore, the review information may have a large influence on the judgment of the classification of the store. In this way, the store information corresponding to the store sample may include at least review information. The comment information may include comment content, comment time, number of comments, and so on.
在一个实施例中,可以通过网络爬虫(如python等)从预定网站(例如××点评等)爬取店铺信息。例如,可以爬取该预定网站中的用户注册信息,或者内容发布信息。然后,可以通过用户注册信息中的注册用户类型(如店铺或消费者)、内容发布信息中所发布内容的类型(如出售或求购等)等,来获取店铺信息。如所发布内容的类型为出售信息,则发布信息的用户可能是店铺方,可以从中获取店铺名、店铺位置、评论信息等。实践中,还可以根据店铺名、店铺位置等信息在电子地图上进行搜索,以确定店铺的分类标签,例如在电子地图上搜索不到的店铺为非真实存在的店铺等。In one embodiment, the store information can be crawled from a predetermined website (for example, XX reviews, etc.) by a web crawler (such as python). For example, you can crawl user registration information or content distribution information in the predetermined website. Then, the store information can be obtained through the type of registered user (such as a store or a consumer) in the user registration information, the type of the content (such as a sale or a purchase) in the content distribution information, and the like. If the type of the published content is sale information, the user who posted the information may be the store side, from which the store name, store location, and review information can be obtained. In practice, you can also search on the electronic map based on information such as store name and store location to determine the classification label of the store. For example, stores that are not searchable on the electronic map are non-existent stores.
在另一个实施例中,也可以通过线下人工收集店铺样本,例如通过人工对网站或地图中的店铺门址一一实地核查,确定其分类标签,同时,还可以通过电话、搜索引擎、行政管理部门登记信息等等中的至少一项,获取相应店铺的店铺信息。其中,店铺信息中的评论信息例如可以通过电话、搜索引擎中的“问答”等来获取。In another embodiment, a sample of the store can also be collected manually offline, for example, by manually checking the store address on the website or map one by one to determine its classification label. At the same time, it can also be performed by phone, search engine, administrative At least one of the management department registration information, etc., to obtain the store information of the corresponding store. The review information in the store information can be obtained by, for example, a phone call, a "question and answer" in a search engine, and the like.
在更多实施例中,还可以通过包含更多方面的获取渠道获取已知分类标签的店铺样本,在此不在一一赘述。In more embodiments, store samples of known classification tags may also be obtained through acquisition channels that include more aspects, which are not described in detail here.
可以理解,对于所获取的店铺样本,需要对其初步筛选,从中选择出预定数量的店铺样本。店铺样本可以包括正样本和负样本。其中,正样本可以对应有真实存在店铺标签,负样本可以对应有非真实存在店铺标签。It can be understood that, for the obtained shop samples, a preliminary screening is needed, and a predetermined number of shop samples are selected from them. Store samples can include positive and negative samples. Among them, a positive sample may correspond to a real store label, and a negative sample may correspond to a non-real store label.
在可能的实施例中,可以选择预定期限(如一个月)内具有以下至少一项行为的店铺作为正样本:销售代金券、团购活动、促销活动(如打折等)、订座服务、问答互动、广告投放、接收到顾客在客户端的签到,等等。实践中,店铺运营中可能采用一些销售手段,例如销售代金券、组织团购活动、组织促销活动等,一些店铺(如酒店、饭店等)可以提供订座服务,还有一些店铺会在相关网站(如旅游攻略网站等)对消费者或潜在消费者进行一些问答互动,还有一些店铺会和一些网站合作,投放广告以增加浏览量或搜索排名等。另外有一些店铺可以通过应用(如某美食点评网站)客户端接收顾客在店 铺的签到,如果顾客点击客户端店铺页面中的签到,在签到位置和店铺位置偏差在设定距离范围(如80米)内的情况下,签到成功。一般地,提供签到的店铺可能是真实存在店铺,当顾客到店消费时,进行签到。因此,可以将当前或预定期限内具有上述行为之一的店铺确定为正样本,并给这些作为正样本的店铺样本分配真实存在店铺标签。In a possible embodiment, a store that has at least one of the following behaviors within a predetermined period (such as one month) can be selected as a positive sample: sales of vouchers, group purchase activities, promotional activities (such as discounts, etc.), reservation services, Q & A interaction , Advertising, receiving customer check-ins on the client, etc. In practice, some sales methods may be used in store operations, such as selling vouchers, organizing group purchases, organizing promotional activities, etc. Some stores (such as hotels, restaurants, etc.) can provide reservation services, and some stores will be available on related websites ( (Such as travel strategy websites, etc.) to conduct some Q & A interactions with consumers or potential consumers, and some stores will cooperate with some websites to place ads to increase page views or search rankings. In addition, some stores can receive customers' check-ins in the store through an application (such as a food review website). If the customer clicks the check-in on the client's store page, the deviation between the check-in location and the store location is within a set distance range (such as 80 meters ), The sign-in is successful. Generally, the store that provides the check-in may be a real store, and when the customer visits the store for consumption, the check-in is performed. Therefore, a store that has one of the above behaviors within the current or predetermined period can be determined as a positive sample, and these store samples that are positive samples can be assigned real store label.
在可能的实施例中,可以选择满足以下条件的店铺作为负样本:在电子地图上被标注为永久停业。在一些地图应用中,当店铺永久停业,则会在地图中将店铺删除,或标注为永久停业。因此,可以利用店铺名称和店铺位置进行搜索,针对电子地图类应用标记为永久停业的店铺,通过电子地图确认店铺位置无误后,作为负样本,并给这些作为负样本的店铺样本分配非真实存在店铺标签。In a possible embodiment, a store that meets the following conditions may be selected as a negative sample: it is marked as permanently closed on the electronic map. In some map applications, when a store is permanently closed, the store will be deleted from the map or marked as permanently closed. Therefore, you can use the store name and store location to search. For stores marked as permanently closed for electronic map applications, use the electronic map to confirm that the store location is correct, and use them as negative samples, and assign these store samples that are negative samples to be non-real. Shop labels.
在获取店铺样本的同时,还可以获取店铺样本对应的店铺信息。店铺信息除了前述评论信息外,例如还可以包括店铺名、店铺地址等。在一些实施例中,店铺信息还可以包括但不限于以下至少一项:店铺基本信息,如电话、营业时间、是否提供无线网络连接(如wifi连接等);店铺品牌名,如××包子铺等;网站或行政监管部门给定的店铺标签,如海外美食精选、当地旅游局推荐等等;店铺分类,如美食、购物、酒店等。While obtaining the store sample, the store information corresponding to the store sample can also be obtained. In addition to the review information, the store information may include, for example, a store name, a store address, and the like. In some embodiments, the store information may further include, but is not limited to, at least one of the following: basic store information, such as phone number, business hours, whether a wireless network connection is provided (such as wifi connection, etc.); store brand name, such as ×× 包子铺Etc .; shop labels given by the website or administrative supervision department, such as overseas food selection, local tourism bureau recommendations, etc .; shop classification, such as food, shopping, hotels, etc.
可以理解,非真实存在的店铺是已经永久停业的店铺,其数量往往小于真实存在的店铺。根据一个可能的设计,可以对所获取的具有真实存在店铺标签的店铺样本进行下采样,使具有真实存在店铺标签的店铺样本和具有非真实存在店铺标签的店铺样本数量大致相等,例如都是45000个。Understandably, non-real stores are shops that have been permanently closed, and their number is often smaller than real stores. According to a possible design, down-sampling the obtained store samples with real store labels can be made to make the number of store samples with real store labels and store samples with non-real store labels approximately equal, for example, 45000 Each.
接着,在步骤22,基于上述店铺信息提取店铺样本的特征。在本实施例中,上述特征至少包括第一特征和第二特征,第一特征至少基于评论信息中的时间相关属性而获取,第二特征基于评论信息中包含的与店铺真实性相关的语义描述而确定。值得说明的是,“第一特征”、“第二特征”中的“第一”、“第二”仅用于区分两个不同的特征,而不表示顺序限定。Next, in step 22, the features of the store sample are extracted based on the store information. In this embodiment, the above features include at least a first feature and a second feature. The first feature is obtained based on at least the time-related attributes in the review information, and the second feature is based on the semantic description related to the authenticity of the store included in the review information. And ok. It is worth noting that the "first" and "second" in the "first feature" and "second feature" are only used to distinguish between two different features, and do not indicate a sequence limitation.
其中,评论信息的时间相关属性例如可以包括但不限于以下至少一项:评论发表时间(如2018年5月1日等)、评论距当前时间的时长(如10小时、20天等)、预定时间段内(如2天)的评论数(如100条)等等。可以理解,一个真实存在的店铺,可能会不断有新的消费者消费并进行评论,因此,最新评论时间往往较晚,评论距当前时间的时长较小,同时,预定时间段内评论数增多的可能性较大;而非真实存在的店铺,因为不再有新的消费者,所以评论时间较早,评论距当前时间的时长较大,预定时间段内评论增多的可能性较小。The time-related attributes of the comment information may include, but are not limited to, at least one of the following: the time when the comment was posted (such as May 1, 2018, etc.), the length of the comment from the current time (such as 10 hours, 20 days, etc.), and the reservation The number of comments (such as 100) in a time period (such as 2 days) and so on. It can be understood that a real store may constantly have new consumers to consume and comment. Therefore, the latest review time is often late, and the length of the review from the current time is small. At the same time, the number of reviews in the predetermined time period increases. It is more likely; instead of a real store, because there are no new consumers, the review time is earlier, the review is longer than the current time, and the possibility of increasing reviews within a predetermined period is less.
相应地,第一特征可以包括但不限于以下中的一项或多项:最新评论的时间、最新评论距离当前时间的时长、预定时间段内的评论数增量。这里,最新评论时间可以是距当前时间最近一次评论的时间,如某个店铺样本的评论信息中,在2015年3月2日20时的一条评论后,没有其他评论,则该店铺样本的最新评论时间为2015年3月2日20时。最新评论距离当前时间的时长可以是当前时间和最新评论时间之间的时间差,如30天等。预定时间段内的评论数增量,即每间隔预定时间段,评论总数的变化量。举例而言,假设预定时间段是3个月,按照评论时间,从当前时间每隔3个月统计一个评论总数,并计算评论数增量,如最近3个月的评论总数为1000,则最近3个月的评论增量为1000。如此,可以充分利用店铺样本在互联网的评论信息的与时间相关属性数据。Accordingly, the first feature may include, but is not limited to, one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and the increment of the number of comments within a predetermined time period. Here, the latest review time may be the time of the latest review from the current time. For example, in the review information of a shop sample, there is no other review after a review at 20:00 on March 2, 2015. The comment time is at 20:00 on March 2, 2015. The length of the latest comment from the current time can be the time difference between the current time and the latest comment time, such as 30 days. The increment of the number of comments in a predetermined time period, that is, the amount of change in the total number of comments every predetermined time period. For example, suppose the predetermined time period is 3 months. According to the comment time, count the total number of comments every 3 months from the current time and calculate the increment of the number of comments. If the total number of comments in the last 3 months is 1000, the most recent The 3-month review increment is 1000. In this way, it is possible to make full use of the time-related attribute data of the review information of the shop samples on the Internet.
评论信息中包含的与店铺真实性相关的语义描述,可以是包含店铺停业或者营业状况良好的语义的描述。例如“该店已经关门大吉不存在了”,可能是店铺已经永久停业的语义的描述。而对于同样的评论语句,发布时间等信息的不同,也可能是表达不同的意思,例如,对于一个饭店,一条评论“大老远跑过来,已经停业了”,发布在晚上12点表达的可能是饭店打烊的意思,而发布在中午12点表达的可能是饭店停业的意思。而对于一个店铺来说,极少量(如1条)包含表达店铺停业的语义的评论就有可能表示这个店铺已经永久停业。因此,可以在特征中包含可以体现评论信息中是否具有店铺永久停业语义描述的第二特征。The semantic description related to the authenticity of the store contained in the review information may be a semantic description containing the store closed or in a good business condition. For example, "the store is closed and no longer exists", it may be a semantic description that the store has been permanently closed. For the same comment sentence, different information such as the time of release may also mean different meanings. For example, for a restaurant, a comment "Da Lao Yuan came over and has been closed" may be expressed at 12 pm It means that the restaurant is closed, and the announcement at 12 noon may mean that the restaurant is closed. For a shop, a very small number of comments (such as 1) that contain the semantics of expressing a shop closure may indicate that the shop has been permanently closed. Therefore, the feature may include a second feature that can reflect whether the review information has a semantic description of the store being permanently closed.
第二特征可以用文字表示,例如:具有店铺永久停业语义描述或包含与店铺真实性相关的语义描述,不具有店铺永久停业语义描述或不包含与店铺真实性相关的语义描述,等等。第二特征还可以用数值表示,例如在具有店铺永久停业语义描述的情况下第二特征为1,不具有店铺永久停业语义描述的情况下第二特征为0,等等。The second feature may be expressed in words, for example, having a semantic description of the store permanently closed or including a semantic description related to the authenticity of the store, not having a semantic description of the store permanently closed or not including a semantic description related to the authenticity of the store, and so on. The second feature may also be represented by a numerical value, for example, the second feature is 1 in the case of having a semantic description of the store permanently closed, the second feature is 0 in the case of having no semantic description of the store permanently closed, and so on.
如图3所示,根据一个可能的设计,第二特征可以通过以下方法提取:步骤31,获取与第一店铺样本对应的第一评论信息;步骤32,利用预先训练的语义模型确定第一评论信息中各条评论数据分别对应的的语义标签,其中,语义标签包括具有停业语义或不具有停业语义;步骤33,按照各个语义标签确定第一店铺样本的第二特征。值得说明的是,这里所称的“第一店铺样本”、“第一评论信息”中的“第一”,表示“某个”、“其中一个”、“任意一个”,以及店铺样本和评论信息的对应关系,而不表示顺序,或者对店铺样本之间的区分。As shown in FIG. 3, according to a possible design, the second feature can be extracted by the following methods: step 31, obtaining first review information corresponding to the first store sample; step 32, determining the first review using a pre-trained semantic model Semantic tags corresponding to each piece of review data in the information, wherein the semantic tags include closed or non-closed semantics; step 33, determine the second feature of the first store sample according to each semantic tag. It is worth noting that the "first" in the "first store sample" and "first review information" referred to here means "some", "one of them", "any one", and the store samples and reviews Correspondence of information, not the order, or the distinction between store samples.
对任意一个店铺样本,在步骤31中,可以先获取该店铺样本的评论信息。一个店铺样本的评论信息,可以对应一条或多条评论数据,每条评论数据可以包括一条评论的评 论内容、评论时间,还可以包括例如发布评论的用户ID之类的数据。For any one store sample, in step 31, the review information of the store sample may be obtained first. The review information of a shop sample may correspond to one or more pieces of review data. Each review data may include a review content, a review time, and data such as a user ID who posted the review.
接着,在步骤32中利用预先训练的语义模型确定评论信息中各条评论数据分别对应的语义标签。可以理解,每条评论数据可以对应一个语义标签。可以将各条评论数据分别输入预先训练的语义模型,根据语义模型的输出确定某条评论数据的语义标签。其中,语义模型可以通过预先标注的评论集进行训练。Next, in step 32, a pre-trained semantic model is used to determine the semantic label corresponding to each piece of review data in the review information. Understandably, each piece of comment data can correspond to a semantic tag. Each piece of comment data can be input into a pre-trained semantic model, and the semantic label of a piece of comment data can be determined according to the output of the semantic model. Among them, the semantic model can be trained through a pre-annotated comment set.
作为一个示例,可以从多个店铺样本的评论数据中选出一些评论加入评论集,尤其针对包含“关门”、“停业”等等之类评论语句的评论数据优先挑选,通过人工识别、标注确定这些评论数据的语义标签作为已知语义标签,训练出一个监督模型,例如逻辑回归LR(logistics regression)模型。模型训练是已知输入(如评论语句)和输出(如已知语义标签),确定模型参数的过程,在此不再赘述。其中,评论数据的语义标签可以包括具有停业语义或不具有停业语义。As an example, some reviews can be selected from the review data of multiple store samples and added to the review set, especially for review data containing review sentences such as "closed", "closed", etc., and determined through manual identification and labeling The semantic labels of these review data are used as known semantic labels to train a supervised model, such as a logistic regression LR (logistics regression) model. Model training is a process of determining model parameters with known inputs (such as comment sentences) and outputs (such as known semantic labels), and will not be repeated here. Wherein, the semantic label of the review data may include the semantics with or without closing semantics.
语义模型的输出可以直接是语义标签中的一个,也可以是数值,例如1,0等等。其中,语义模型的输出是两个可能的数值(如1、0等)中的一个,其中每一个数值对应一个语义标签,如1对应具有停业语义标签。语义模型的输出也可以是多个可能的数值(如0-1之间的任意小数等)中的一个,可以设定阈值,用以判断所输出的数值更偏向哪种语义标签,如大于0.6偏向具有停业语义标签。The output of the semantic model can be one of the semantic labels directly, or it can be a numerical value, such as 1, 0, and so on. Among them, the output of the semantic model is one of two possible values (such as 1, 0, etc.), where each value corresponds to a semantic label, such as 1 corresponding to a closed business semantic label. The output of the semantic model can also be one of multiple possible values (such as any decimal between 0-1, etc.). A threshold can be set to determine which semantic label the output value is more biased to, such as greater than 0.6. Prefer to have closed semantic labels.
根据一种实施方式,针对评论信息中的每一条评论数据,可以先通过无监督词向量模型(如word2vec模型)将该评论数据中的各个词分别表示成各个词向量;基于各个词向量,确定该评论数据对应的评论向量;将所确定的评论向量输入语义模型,以获取语义模型的输出结果;按照输出结果为该评论数据添加语义标签。其中,基于各个词向量,确定该评论数据对应的评论向量,例如可以是对各个词向量的不同维度求平均,或者对不同维度求加权平均,等等。According to one embodiment, for each piece of review data in the review information, each word in the review data may be first expressed as a respective word vector through an unsupervised word vector model (such as the word2vec model); based on each word vector, it is determined A comment vector corresponding to the comment data; input the determined comment vector into a semantic model to obtain an output result of the semantic model; and add a semantic label to the comment data according to the output result. The review vector corresponding to the review data is determined based on each word vector. For example, the review vector may be an average of different dimensions of each word vector, or a weighted average of different dimensions.
举例而言,对于评论数据“该店已经关门大吉不存在了”,可以先对其进行切词、过滤虚词等,得到词汇“该店”、“关门大吉”、“不存在”,假设词向量模型有3个维度a、b、c,将各个词汇表示成词向量分别为:
Figure PCTCN2019080022-appb-000001
Figure PCTCN2019080022-appb-000002
在一个实现中,基于各个词向量确定的该评论数据对应的评论向量可以为:
Figure PCTCN2019080022-appb-000003
在另一个实现中,还可以将各个词汇的出现次数作为权重,对各个词向量的不同纬度求加权平均得到评论向量为:
Figure PCTCN2019080022-appb-000004
其中,各个参数前 面的1为相应词汇出现的次数,分母中为各个词汇出现的次数和,在该示例中各个词汇出现的次数都是1次,在实际中还可以是其他值。
For example, for the review data "the store has closed the door no longer exists", you can first cut the word, filter the words, etc. to get the words "the store", "close the door", "not exist", assuming the word vector The model has three dimensions a, b, and c, and each word is represented as a word vector:
Figure PCTCN2019080022-appb-000001
Figure PCTCN2019080022-appb-000002
In one implementation, the comment vector corresponding to the comment data determined based on each word vector may be:
Figure PCTCN2019080022-appb-000003
In another implementation, the number of occurrences of each vocabulary can also be used as a weight, and a weighted average of different latitudes of each word vector to obtain a comment vector is:
Figure PCTCN2019080022-appb-000004
The 1 in front of each parameter is the number of occurrences of the corresponding vocabulary, and the denominator is the sum of the number of occurrences of each vocabulary. In this example, the number of occurrences of each vocabulary is 1 and can be other values in practice.
进一步地,可以将评论向量输入语义模型,从而获取语义模型的输出结果。可以理解,评论向量还可以表示为
Figure PCTCN2019080022-appb-000005
其中每项作为一个特征输入语义模型。然后,可以按照语义模型的输出结果为该评论数据添加语义标签。例如,语义模型的输出是1,为该评论数据添加“具有停业语义”的语义标签等。
Further, the comment vector can be input to the semantic model, so as to obtain the output of the semantic model. Understandably, the comment vector can also be expressed as
Figure PCTCN2019080022-appb-000005
Each of them is input as a feature into the semantic model. Then, you can add semantic tags to the comment data according to the output of the semantic model. For example, the output of the semantic model is 1, and a semantic tag of "with closing semantics" is added to the comment data.
如此,可以对一个店铺样本的评论信息中每条评论数据都添加一个语义标签。In this way, a semantic tag can be added to each piece of review data in the review information of a shop sample.
步骤33,按照店铺样本对应的各个语义标签确定相应店铺样本的第二特征。可以将第二特征确定为,具有店铺永久停业语义描述或包含与店铺真实性相关的语义描述,不具有店铺永久停业语义描述或不包含与店铺真实性相关的语义描述,数值1、0等等。Step 33: Determine the second feature of the corresponding store sample according to each semantic tag corresponding to the store sample. The second feature may be determined as having a storehouse permanent shutdown semantic description or including a semantic description related to the authenticity of the store, without a store permanent shutdown semantic description or including a storefront-related semantic description, a value of 1, 0, etc .
进一步地,在一个实施例中,在第一店铺样本对应的各个语义标签中,任意一个语义标签为具有停业语义的标签的情况下,确定店铺样本的第二特征为,包含店铺为非真实存在店铺的语义。Further, in one embodiment, in the case where any one of the semantic tags corresponding to the first store sample is a tag with a closing semantics, the second feature of determining the store sample is that the store includes a non-real existence The semantics of the store.
对于一些特殊情况,例如用户发泄情绪,发布评论为“这店早该关门了”,也可能会被添加具有停业语义的标签。因此,在另一个实施例中,还可以设定一个个数阈值,当具有停业语义的标签的评论数据条数超过该个数阈值(如10条等)时,才确定店铺样本的第二特征为,包含店铺为非真实存在店铺的语义。For some special cases, such as users venting their emotions, posting a comment as "this store should be closed for a long time", or it may be added with the tag of closing business semantics. Therefore, in another embodiment, a number threshold may also be set, and the second characteristic of the store sample is determined only when the number of comment data of tags with the semantics of closing business exceeds the number threshold (such as 10). To include the semantics that the store is not a real store.
如此,可以充分利用互联网中店铺样本的评论信息中与店铺真实性相关的语义描述数据。In this way, the semantic description data related to the authenticity of the store in the review information of the sample of the store on the Internet can be fully utilized.
在一个实施例中,店铺样本的特征除了第一特征和第二特征外,还可以包括评论数量特征,例如评论总条数,好评条数、中评条数、差评条数的比值、评论中的图片数量等。可以理解,对于差评比重较大的店铺,为非真实存在的店铺的可能性较大;对于评论总条数较多,或者评论中的图片数量较多的店铺,为真实存在的店铺的可能性较大。由此,评论数量特征可以作为影响对店铺进行当前是否为真实存在的店铺分类的一个因素。In one embodiment, in addition to the first feature and the second feature, the characteristics of the shop sample may also include the number of reviews, such as the total number of reviews, the number of positive reviews, the number of positive reviews, the ratio of the number of negative reviews, and reviews. Number of pictures in, etc. It can be understood that for shops with a large proportion of negative reviews, it is more likely to be non-real shops; for shops with a large total number of reviews or a large number of pictures in the reviews, it is likely to be real shops Sex. Therefore, the feature of the number of reviews can be used as a factor that influences whether a store is classified as a real store.
在一个实施例中,店铺样本的特征还可以包括基本信息完备度特征。基本信息例如电话、营业时间、是否提供无线网络连接(如wifi等)、服务设施等等。基本信息越完备的店铺,为真实存在的店铺的可能性越大。可选地,基本信息完备度可以与基本信息项数成正比。由此,基本信息完备度特征可以作为影响对店铺进行当前是否为真实存在 的店铺分类的一个因素。In one embodiment, the characteristics of the store sample may further include basic information completeness characteristics. Basic information such as telephone, business hours, whether wireless network connection is available (such as wifi, etc.), service facilities and so on. The more complete the basic information is, the more likely it is that the store exists. Optionally, the basic information completeness may be proportional to the number of basic information items. Therefore, the basic information completeness feature can be used as a factor that influences whether the store is classified as a real store.
在一个实施例中,店铺样本的特征还可以包括预定标识特征。预定标识例如可以是品牌店铺、连锁店铺、网站或行政管理机构给出的优选标签(如当地旅游局推荐标签)等等。可以理解,品牌店铺或连锁店铺往往是指知名度、市场认可度较高的店铺,这些店铺是真实存在店铺的可能性较大。网站或行政管理机构给出优选标签的店铺往往是通过审核、检测的店铺,这些店铺是真实存在店铺的可能性也较大。由此,预定标识特征可以作为影响对店铺进行当前是否为真实存在的店铺分类的一个因素。In one embodiment, the characteristics of the store sample may further include predetermined identification characteristics. The predetermined identifier may be, for example, a preferred label (such as a local tourism bureau recommendation label) given by a brand store, a chain store, a website, or an administrative agency. Understandably, brand stores or chain stores often refer to stores with high visibility and market recognition. These stores are more likely to be real stores. Websites or administrative agencies give preferred labels to stores that have passed audits and inspections. These stores are more likely to be real stores. Therefore, the predetermined identification feature can be used as a factor that influences whether the store is classified as a real store.
在一个实施例中,店铺样本的特征还可以包括店铺经营类别特征。店铺经营类别例如可以是美食、酒店、服装,等等。在一些网站中,对于美食店铺,评论较多,如果仅通过评论数量进行分类,准确度较低,因此,还可以将不同经营类别的店铺区别对待,对评论数较少的店铺经营类别的店铺,给予较大权重。In one embodiment, the characteristics of the store sample may further include store operation category characteristics. The store management category may be, for example, food, hotel, clothing, and so on. In some websites, there are many reviews for gourmet shops. If you only classify by the number of reviews, the accuracy is low. Therefore, you can also treat the shops in different business categories differently, and treat the shops with fewer reviews in the business category. , Give greater weight.
在一个实施例中,店铺样本的特征还可以包括消费者打分特征。消费者打分可以是分值,也可以是评星等。值得说明的是,如果店铺样本从同一个网站获取,消费者打分具有同一的标准,则可以将消费者的打分直接作为消费者打分特征。如果店铺样本不是从同一个网站获取,打分标准也可能不同,则可以将消费者打分与满分的比例作为消费者打分特征。消费者评分高低会影响店铺的客流量,如果客流量很低,则成为非真实存在的店铺的可能性较大,由此,消费者打分特征可以作为影响对店铺进行当前是否为真实存在的店铺分类的一个因素。In one embodiment, the characteristics of the store sample may also include consumer scoring characteristics. Consumer ratings can be either points or star ratings. It is worth noting that if the store samples are obtained from the same website and the consumer scores have the same standard, the consumer scores can be directly used as the consumer score characteristics. If the store samples are not obtained from the same website, and the scoring standards may also be different, the ratio of consumer scores to full marks can be used as a consumer scoring feature. Consumer ratings will affect the customer flow of the store. If the customer flow is low, it is more likely to become a non-real store. Therefore, the consumer scoring characteristics can be used to influence whether the store is currently a real store. A factor in classification.
在更多的实施例中,店铺样本的特征还可以包括更多的特征,在此不再一一例举。In more embodiments, the features of the store sample may also include more features, which will not be exemplified here.
步骤23,基于各个店铺样本的特征和分类标签训练上述分类模型。可以理解,模型训练的过程就是已知输入特征和分类结果,确定模型参数的过程。在本说明书中,输入特征就是店铺样本的特征,其中特征包括多个输入特征,分类结果根据店铺样本的分类标签确定,例如输出结果包括0、1,0表示真实存在店铺标签,1表示非真实存在店铺标签,等等。一个店铺样本对应一组已知输入特征和分类结果。In step 23, the classification model is trained based on the characteristics and classification labels of each store sample. It can be understood that the process of model training is the process of determining model parameters based on known input features and classification results. In this specification, the input feature is the feature of the store sample, where the feature includes multiple input features, and the classification result is determined according to the classification label of the store sample. For example, the output result includes 0, 1, 0 is a real store label, and 1 is not real. Store labels exist, and so on. A store sample corresponds to a set of known input features and classification results.
如图4所示,在训练分类模型过程中,在输入层42输入的已知输入特征是各个店铺样本的特征,输出层43的输出结果可以与相应店铺样本的分类标签相比较。根据比较结果调整中间层44的各个参数,以及,输入层42的特征与中间层44之间、中间层44与输出层43之间各个箭头连线代表的权重参数。As shown in FIG. 4, during the training of the classification model, the known input features input in the input layer 42 are the characteristics of each store sample, and the output results of the output layer 43 can be compared with the classification labels of the corresponding store samples. According to the comparison result, various parameters of the intermediate layer 44 are adjusted, and weight parameters represented by the arrows between the features of the input layer 42 and the intermediate layer 44 and between the arrows of the intermediate layer 44 and the output layer 43 are represented.
在图4中,输入层42输入的已知输入特征,包括第一特征421和第二特征422,第 一特征421和第二特征422分别通过店铺信息41中的评论信息411相关数据获取。In FIG. 4, the known input features input by the input layer 42 include a first feature 421 and a second feature 422, and the first feature 421 and the second feature 422 are respectively obtained from the review information 411 related data in the store information 41.
在一个可能的设计中,可以将店铺样本分为训练样本和测试样本。在分类模型的训练过程中,依次将各个训练样本的特征作为输入,根据分类模型的输出结果与分类标签的对比调整分类模型的各个分类参数使分类模型的输出结果与当前输入的训练样本的分类标签更一致,以训练分类模型。接着,将各个测试样本的特征输入通过训练样本训练过的分类模型,用测试样本对应的分类标签检测分类模型的各个输出结果的准确性,以获得对分类模型的检测结果。例如,如果分类标签和分类模型的输出结果一致,则确定分类模型的输出结果正确。由此,可以获得分类模型对测试样本整体的检测结果,如准确度等。In a possible design, store samples can be divided into training samples and test samples. During the training of the classification model, the features of each training sample are used as input in order, and each classification parameter of the classification model is adjusted according to the comparison between the output of the classification model and the classification label, so that the output of the classification model is classified with the currently input training sample. The labels are more consistent to train the classification model. Next, the features of each test sample are input into the classification model trained by the training sample, and the classification labels corresponding to the test samples are used to detect the accuracy of each output result of the classification model to obtain the detection result of the classification model. For example, if the output of the classification label and the classification model are consistent, it is determined that the output of the classification model is correct. In this way, the detection results of the classification model on the entire test sample, such as accuracy, can be obtained.
在所得到的检测结果不满足预定条件的情况下,可以进一步根据检测结果调整分类模型。例如调整分类模型的网格结构、调换分类模型等等。例如当分类模型是梯度提升决策树GBDT模型时,可以调节树的数目、每棵树的深度、学习率等。调整好分类模型后,重新用训练样本训练分类模型,并用测试样本获得对分类模型的检测结果。直至测试样本对检测结果满足预设条件。In a case where the obtained detection result does not satisfy a predetermined condition, the classification model may be further adjusted according to the detection result. For example, adjust the grid structure of the classification model, change the classification model, and so on. For example, when the classification model is a GBDT model of gradient boosted decision tree, the number of trees, the depth of each tree, and the learning rate can be adjusted. After adjusting the classification model, use the training samples to train the classification model again, and use the test samples to obtain the detection results of the classification model. Until the test sample meets the preset conditions.
其中,这里的预设条件可以是对分类模型的检测结果设定的条件。例如当分类模型是梯度提升决策树GBDT模型时,检测结果可以包括曲线下面积AUC、精度、召回率、F1分数等等的值。例如预设条件是精度和召回率都大于0.7等等。在根据本说明书实施例的一次实验中,可以达到AUC=0.868,精度=0.767,召回率=0.803,F1为0.784。The preset condition here may be a condition set on a detection result of the classification model. For example, when the classification model is a gradient boosted decision tree GBDT model, the detection result may include values of the area under the curve, AUC, accuracy, recall, F1 score, and so on. For example, the preset conditions are that the accuracy and recall rate are both greater than 0.7 and so on. In one experiment according to the embodiment of the present specification, AUC = 0.868, accuracy = 0.767, recall rate = 0.803, and F1 is 0.784.
回顾以上过程,所选择的店铺样本对应的店铺信息中包括有评论信息,由此,根据店铺信息提取的特征中可以至少包括:基于评论信息的时间相关属性而获取的第一特征,基于评论信息中包含的与店铺真实性相关的语义描述而确定的第二特征。如此,基于包含第一特征和第二特征的特征训练分类模型,可以充分利用互联网数据,训练准确度较高的分类模型,从而提高店铺分类的有效性。Reviewing the above process, the store information corresponding to the selected store sample includes review information. Therefore, the features extracted from the store information may include at least: a first feature obtained based on the time-related attributes of the review information, based on the review information The second feature determined by the semantic description related to the authenticity of the store. In this way, training a classification model based on features including the first feature and the second feature can make full use of Internet data to train a classification model with higher accuracy, thereby improving the effectiveness of store classification.
根据另一方面的实施例,还提供一种店铺分类的方法。用于通过分类模型判断店铺当前是否为真实存在的店铺。该方法适用于具有一定数据处理能力的电子设备,例如图1中的服务器。According to an embodiment of another aspect, a method for classifying a store is also provided. It is used to determine whether the store is a real store through a classification model. This method is suitable for an electronic device with a certain data processing capability, such as the server in FIG. 1.
如图5所示,该店铺分类的方法以实施例的流程包括:步骤51,获取待分类店铺的店铺信息,其中,店铺信息包括评论信息;步骤52,基于店铺信息提取待分类店铺的特征,其中,该特征至少包括第一特征和第二特征,第一特征至少基于评论信息的时间相 关属性而获取,第二特征基于评论信息中包含的与店铺真实性相关的语义描述而确定;步骤53,将待分类店铺的特征输入分类模型,以获取分类模型的输出结果;步骤54,根据输出结果确定待分类店铺当前是否为真实存在的店铺。As shown in FIG. 5, the embodiment of the method for classifying a store includes the steps of: step 51, obtaining store information of a store to be classified, where the store information includes review information; step 52, extracting characteristics of the store to be classified based on the store information, The feature includes at least a first feature and a second feature. The first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; step 53 , Inputting the characteristics of the store to be classified into the classification model to obtain an output result of the classification model; step 54, determine whether the store to be classified is currently a true store according to the output result.
首先,在步骤51中,获取待分类店铺的店铺信息。其中,店铺信息中至少包括有评论信息,例如评论内容、评论时间、评论数量等等。店铺信息还可以包括但不限于以下至少一项:店铺基本信息、店铺品牌名、网站或行政监管部门给定的店铺标签、店铺分类等。可以通过网络爬虫(如python)等从预定网站(例如××点评等)爬取店铺信息。First, in step 51, store information of a store to be classified is obtained. The store information includes at least review information, such as review content, review time, and number of reviews. The store information may also include but is not limited to at least one of the following: basic store information, store brand name, store label given by the website or administrative supervision department, store classification, etc. Store information can be crawled from a predetermined website (such as ×× comments, etc.) through a web crawler (such as python).
接着,通过步骤52,基于店铺信息提取待分类店铺的特征。这里的特征和分类模型的输入特征相对应。其中,该特征至少包括第一特征和第二特征,第一特征至少基于评论信息的时间相关属性而获取,第二特征基于评论信息中包含的与店铺真实性相关的语义描述而确定。值得说明的是,“第一特征”、“第二特征”中的“第一”、“第二”仅用于区分两个不同的特征,而不表示顺序限定。Next, in step 52, the features of the store to be classified are extracted based on the store information. The features here correspond to the input features of the classification model. The feature includes at least a first feature and a second feature. The first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information. It is worth noting that the "first" and "second" in the "first feature" and "second feature" are only used to distinguish between two different features, and do not indicate a sequence limitation.
其中,评论信息的与时间相关属性例如可以包括但不限于以下至少一项:评论发表时间、评论距当前时间的时长、预定时间段内的评论数等等。相应地,第一特征可以包括但不限于以下中的一项或多项:最新评论的时间、最新评论距离当前时间的时长、预定时间段内的评论数增量。如此,可以充分利用店铺样本在互联网的评论信息的与时间相关属性数据。The time-related attributes of the review information may include, but are not limited to, at least one of the following: a review posting time, a duration of the review from the current time, a number of reviews in a predetermined time period, and the like. Accordingly, the first feature may include, but is not limited to, one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and the increment of the number of comments within a predetermined time period. In this way, it is possible to make full use of the time-related attribute data of the review information of the shop samples on the Internet.
评论信息中包含的与店铺真实性相关的语义描述,可以是包含店铺停业或者营业状况良好的语义的描述。对于一个店铺来说,极少量(如1条)包含表达店铺永久停业的语义的评论就有可能表示这个店铺已经永久停业。因此,可以通过评论信息中是否具有店铺永久停业语义描述的第二特征,来为店铺分类。第二特征可以用文字表示,也可以用数值表示。The semantic description related to the authenticity of the store contained in the review information may be a semantic description containing the store closed or in a good business condition. For a store, a very small number of comments (such as one) that contain the semantics of expressing the store's permanent closure may indicate that the store has been permanently closed. Therefore, the store can be classified according to whether the review information has the second feature of the semantic description that the store is permanently closed. The second feature can be expressed in words or numerically.
根据一个可能的设计,第二特征可以通过以下方法提取:获取待分类店铺的评论信息;利用预先训练的语义模型确定评论信息中各条评论数据分别对应的语义标签,其中,语义标签包括具有停业语义或不具有停业语义;按照待分类店铺对应的各个语义标签确定待分类店铺的第二特征。According to a possible design, the second feature can be extracted by: obtaining the review information of the store to be classified; using a pre-trained semantic model to determine the corresponding semantic tag of each piece of review data in the review information, wherein the semantic tag includes a closed Semantic or does not have closing semantics; the second feature of the store to be classified is determined according to each semantic tag corresponding to the store to be classified.
容易理解,一个待分类店铺的评论信息,可以对应一条或多条评论数据,每条评论数据可以包括一条评论的评论内容、评论时间,还可以包括例如发布评论的用户ID之类的数据。可以将各条评论数据分别输入预先训练的语义模型,根据语义模型的输出确 定每条评论数据的语义标签。然后,根据这些语义标签确定待分类店铺的第二特征。根据一种实施方式,针对评论信息中的每一条评论数据,可以先通过无监督词向量模型(如word2vec模型)将该评论数据中的各个词分别表示成各个词向量;基于各个词向量,确定该评论数据对应的评论向量;将所确定的评论向量输入语义模型,以获取语义模型的输出结果;按照输出结果为该评论数据添加语义标签。It is easy to understand that the review information of a store to be classified may correspond to one or more pieces of review data. Each review data may include a review content, a review time, and data such as a user ID who posted the review. Each piece of review data can be input into a pre-trained semantic model, and the semantic label of each piece of review data is determined based on the output of the semantic model. Then, the second feature of the store to be classified is determined according to these semantic tags. According to one embodiment, for each piece of review data in the review information, each word in the review data may be first expressed as a respective word vector through an unsupervised word vector model (such as the word2vec model); based on each word vector, it is determined A comment vector corresponding to the comment data; input the determined comment vector into a semantic model to obtain an output result of the semantic model; and add a semantic label to the comment data according to the output result.
在一个实施例中,在待分类店铺对应的各个语义标签中,任意一个语义标签为具有停业语义的标签的情况下,确定待分类店铺的第二特征为,包含店铺为非真实存在店铺的语义。在另一个实施例中,还可以设定一个个数阈值,当具有停业语义的标签的评论数据条数超过该个数阈值时,才确定店铺样本的第二特征为,包含店铺为非真实存在店铺的语义。In one embodiment, in the case where any of the semantic tags corresponding to the store to be classified is a tag with a closing semantics, the second feature of determining the store to be classified is to include the semantics that the store is not a real store . In another embodiment, a number threshold may also be set. When the number of comment data of tags with a closing semantics exceeds the number threshold, it is determined that the second characteristic of the store sample is that the store is not real. The semantics of the store.
如此,可以充分利用互联网中店铺样本的评论信息中与店铺真实性相关的语义描述数据。In this way, the semantic description data related to the authenticity of the store in the review information of the sample of the store on the Internet can be fully utilized.
在一些可能的设计中,待分类店铺的特征除了第一特征和第二特征,还可以包括但不限于以下至少一项:评论数量特征、基本信息完备度特征、预定标识特征、店铺经营类别特征、消费者打分特征等等。In some possible designs, in addition to the first and second features, the characteristics of the store to be classified may include, but are not limited to, at least one of the following: the number of reviews, the basic information completeness feature, the predetermined identification feature, and the store operation category feature , Consumer scoring characteristics, and more.
步骤53,将待分类店铺的特征输入分类模型,以获取分类模型的输出结果。其中,分类模型的输出结果可以是数值、也可以是分类标签。当分类模型的输出结果是分类标签时,分类标签可以包括真实存在店铺标签和非真实存在店铺标签。Step 53: Input the characteristics of the store to be classified into a classification model to obtain an output result of the classification model. The output of the classification model can be a numerical value or a classification label. When the output of the classification model is a classification label, the classification label may include a real store label and a non-real store label.
如图4所示,从店铺41中提取到的待分类店铺的特征输入到输入层42,其中,特征包括了通过评论信息411提取的第一特征421和第二特征422。经过中间层44后,从输出层43得到输出结果。As shown in FIG. 4, the features of the store to be classified extracted from the store 41 are input to the input layer 42, where the features include the first feature 421 and the second feature 422 extracted through the review information 411. After passing through the intermediate layer 44, an output result is obtained from the output layer 43.
步骤54,根据输出结果确定待分类店铺当前是否为真实存在的店铺。当输出结果是分类标签时,直接按照分类标签确定待分类店铺是否为真实存在店铺,具有真实存在店铺标签的待分类店铺为真实存在店铺,否则为非真实存在店铺。当输出结果是数值时,如果数值是二选一,例如只有1和0两种情况,则根据相应数值对应到待分类店铺是否为真实存在的店铺的分类标签。如果是多个可能是数值,例如0-1之间的任意数值时,可以根据数值偏向哪一端确定待分类店铺是否为真实存在的店铺的分类标签。至于数值偏向哪一端,可以根据设定阈值确定,例如设定偏向1的阈值为0.6,则大于0.6的数值都是偏向1的数值,可以对应非真实存在的店铺的分类标签。Step 54: Determine whether the store to be classified is a real store currently according to the output result. When the output result is a classification label, it is directly determined whether the store to be classified is a real store according to the classification label, and the store to be classified with a real store label is a real store, otherwise it is a non-real store. When the output result is a numerical value, if the numerical value is one of two choices, for example, there are only two cases of 1 and 0, the classification label corresponding to whether the store to be classified is a real store exists according to the corresponding value. If there are multiple possible values, for example, any value between 0-1, the classification label of the store to be classified can be determined according to which end the value is biased to. As for which end the value is biased, it can be determined according to a set threshold value. For example, if the threshold value set to 1 is 0.6, values greater than 0.6 are all values biased to 1, which can correspond to the classification labels of non-existing stores.
值得说明的是,由于图5示出的方法实施例中,为店铺分类的方法是通过图2的实施例训练的分类模型进行的,因此,图2所示的实施例中,关于店铺样本的相关描述,也适应于图5所示实施例中提到的对待分类店铺的相应内容,在此不再赘述。It is worth noting that, in the method embodiment shown in FIG. 5, the method for classifying a store is performed by using a classification model trained in the embodiment of FIG. 2. Therefore, in the embodiment shown in FIG. The related description is also applicable to the corresponding content of the store to be classified mentioned in the embodiment shown in FIG. 5, and details are not described herein again.
根据另一方面的实施例,还提供一种分类模型的训练装置。图6示出根据一个实施例的用于分类模型的训练装置的示意性框图。如图6所示,用于分类模型的训练的装置600包括:选择单元61,配置为选择预定数量的店铺样本,店铺样本对应有店铺信息和分类标签,分类标签包括真实存在店铺标签和非真实存在店铺标签,店铺信息包括评论信息;提取单元62,配置为基于店铺信息提取店铺样本的特征,其中,上述特征至少包括第一特征和第二特征,第一特征至少基于评论信息的时间相关属性而获取,第二特征基于评论信息中包含的与店铺真实性相关的语义描述而确定;训练单元63,配置为基于各个店铺样本的特征和分类标签训练分类模型。According to an embodiment of another aspect, a training device for a classification model is also provided. FIG. 6 shows a schematic block diagram of a training apparatus for a classification model according to an embodiment. As shown in FIG. 6, the apparatus 600 for training a classification model includes a selection unit 61 configured to select a predetermined number of store samples. The store samples correspond to store information and classification labels. The classification labels include real store labels and non-real ones. There is a shop tag, and the shop information includes review information; the extraction unit 62 is configured to extract features of the shop sample based on the shop information, wherein the aforementioned features include at least a first feature and a second feature, and the first feature is based at least on time-related attributes of the review information And obtained, the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; the training unit 63 is configured to train a classification model based on the characteristics and classification tags of each store sample.
可以理解,店铺样本可以包括正样本和负样本,其中正样本对应有真实存在店铺标签,负样本对应有非真实存在店铺标签。进一步地,在一个实施例中,选择单元61可以配置为:选择预定期限内具有以下至少一项行为的店铺作为正样本:销售代金券、团购活动、促销活动、订座服务、问答互动、广告投放、接收顾客在客户端的签到。在另一个实施例中,选择单元61还可以配置为:选择满足以下条件的店铺作为负样本:在电子地图上被标注为永久停业。It can be understood that the store sample may include a positive sample and a negative sample, where the positive sample corresponds to a real store label and the negative sample corresponds to a non-real store label. Further, in one embodiment, the selection unit 61 may be configured to select, as a positive sample, stores that have at least one of the following behaviors within a predetermined period: sales vouchers, group purchase activities, promotional activities, reservation services, Q & A interactions, advertisements Place and receive customer sign-in on the client. In another embodiment, the selecting unit 61 may be further configured to: select a store that meets the following conditions as a negative sample: it is marked as permanently closed on the electronic map.
根据一方面的实施例,第一特征可以包括以下中的一项或多项:最新评论的时间、最新评论距离当前时间的时长、预定时间段内的评论数增量。According to an embodiment of the aspect, the first feature may include one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.
根据另一方面的实施例,提取第二特征时,提取单元62还可以包括:评论信息获取模块,配置为获取第一店铺样本的第一评论信息;语义标签确定模块,配置为利用预先训练的语义模型确定第一评论信息中各条评论数据分别对应的语义标签,其中,语义标签包括具有停业语义或不具有停业语义;第二特征确定模块,配置为按照各个语义标签确定第一店铺样本的第二特征。值得说明的是,“第一特征”、“第二特征”中的“第一”、“第二”仅用于区分两个不同的特征,而不表示顺序限定。According to an embodiment of the other aspect, when extracting the second feature, the extraction unit 62 may further include: a review information acquisition module configured to acquire the first review information of the first store sample; a semantic label determination module configured to utilize a pre-trained The semantic model determines the semantic tags corresponding to each piece of review data in the first review information, wherein the semantic tags include closed or non-closed semantics; a second feature determination module configured to determine the The second feature. It is worth noting that the "first" and "second" in the "first feature" and "second feature" are only used to distinguish between two different features, and do not indicate a sequence limitation.
进一步地,第二特征确定模块进一步还可以配置为:在第一店铺样本对应的各个语义标签中包含具有停业语义的标签的情况下,确定第一店铺样本的第二特征为,包含店铺为非真实存在店铺的语义。这里所称的“第一店铺样本”、“第一评论信息”中的“第一”,表示“某个”、“其中一个”、“任意一个”,以及店铺样本和评论信息的对应关系,而不表示顺序,或者对店铺样本之间的区分。Further, the second feature determination module may be further configured to: in a case where each semantic tag corresponding to the first store sample includes a tag with a closing semantics, determine the second feature of the first store sample as including that the store is not Store semantics really exist. The "first" in the "first store sample" and "first review information" referred to here means "some", "one", "any", and the corresponding relationship between the store sample and the review information, It does not indicate order or distinction between store samples.
语义标签确定模块进一步还可以配置为:针对第一评论信息中的第一评论数据,通过无监督词向量模型将第一评论数据中的各个词分别表示成各个词向量;基于各个词向量,确定第一评论数据对应的第一评论向量;将第一评论向量输入语义模型,以获取语义模型的输出结果;按照输出结果为第一评论数据添加语义标签。The semantic label determination module may be further configured to: for the first review data in the first review information, each word in the first review data is represented as each word vector through an unsupervised word vector model; based on each word vector, determine A first review vector corresponding to the first review data; inputting the first review vector into the semantic model to obtain an output result of the semantic model; and adding a semantic label to the first review data according to the output result.
在可能的实施方式中,上述特征还可以包括但不限于以下至少一个特征:评论数量特征、基本信息完备度特征、预定标识特征、店铺经营类别特征、消费者打分特征。In a possible implementation manner, the above-mentioned features may further include, but are not limited to, at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.
根据一个可能的设计,店铺样本包括训练样本和测试样本,以及,训练单元63可以包括:训练模块,配置为将各个训练样本的特征作为输入,根据所述分类模型的输出结果与所述分类标签的对比,调整分类模型的各个分类参数,以训练分类模型;测试模块,配置为将各个测试样本的特征输入通过所述训练样本训练过的所述分类模型,用测试样本对应的分类标签检测分类模型的各个输出结果的准确性,以获得对分类模型的检测结果;调整模块,配置为在检测结果不满足预设条件的情况下,根据所述检测结果调整分类模型。例如调整分类模型的网格结构、调换分类模型等等。其中,这里的预设条件可以是对分类模型的评价参数条件。例如当分类模型是梯度提升决策树GBDT模型时,模型评价参数可以包括曲线下面积AUC、精度、召回率、F1分数等等。According to a possible design, the store samples include training samples and test samples, and the training unit 63 may include: a training module configured to take features of each training sample as input, and according to an output result of the classification model and the classification label In comparison, adjust each classification parameter of the classification model to train the classification model; the test module is configured to input the characteristics of each test sample into the classification model trained by the training sample, and detect the classification using the classification label corresponding to the test sample The accuracy of each output result of the model to obtain the detection result of the classification model; the adjustment module is configured to adjust the classification model according to the detection result if the detection result does not satisfy a preset condition. For example, adjust the grid structure of the classification model, change the classification model, and so on. The preset condition here may be an evaluation parameter condition for the classification model. For example, when the classification model is a gradient boosted decision tree GBDT model, the model evaluation parameters may include the area under the curve, AUC, accuracy, recall rate, F1 score, and so on.
通过以上装置,可以充分利用互联网数据,训练准确度较高的分类模型,从而提高店铺分类的有效性。Through the above devices, it is possible to make full use of Internet data and train a classification model with higher accuracy, thereby improving the effectiveness of store classification.
值得说明的是,图6所示的装置600与图2所示的方法相对应,因此,针对图2中的相关描述同样适用于装置600,在此不再赘述。It is worth noting that the apparatus 600 shown in FIG. 6 corresponds to the method shown in FIG. 2. Therefore, the related description in FIG. 2 is also applicable to the apparatus 600, and details are not described herein again.
根据再一方面的实施例,还提供一种店铺分类的装置。图7示出根据一个实施例的用于店铺分类装置的示意性框图。如图7所示,用于店铺分类的装置700包括:获取单元71,配置为获取待分类店铺的店铺信息,其中,店铺信息包括评论信息;提取单元72,配置为基于店铺信息提取待分类店铺的特征,其中,特征至少包括第一特征和第二特征,第一特征至少基于评论信息的时间相关属性而获取,第二特征基于评论信息中包含的与店铺真实性相关的语义描述而确定;分类单元73,配置为将待分类店铺的特征输入所述模型,以获取分类模型的输出结果;确定单元74,配置为根据输出结果确定待分类店铺当前是否为真实存在的店铺。According to an embodiment of still another aspect, a device for classifying a store is also provided. Fig. 7 shows a schematic block diagram for a store classification device according to one embodiment. As shown in FIG. 7, the apparatus 700 for sorting a store includes: an obtaining unit 71 configured to obtain store information of a store to be classified, wherein the store information includes review information; and an extracting unit 72 configured to extract the store to be classified based on the store information Feature, wherein the feature includes at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; The classification unit 73 is configured to input characteristics of the store to be classified into the model to obtain an output result of the classification model; and the determination unit 74 is configured to determine whether the store to be classified is a real store currently based on the output result.
在一个可能的设计中,第一特征可以包括以下中的一项或多项:最新评论的时间、最新评论距离当前时间的时长、预定时间段内的评论数增量。In a possible design, the first feature may include one or more of the following: the time of the most recent comment, the length of time of the most recent comment from the current time, and the increment of the number of comments within a predetermined time period.
根据一个实施方式,第二特征可以通过以下方法提取:获取第一店铺样本的第一评论信息;利用预先训练的语义模型确定第一所述评论信息中各条评论数据分别对应的语义标签,其中,语义标签包括具有停业语义或不具有停业语义;按照各个语义标签确定第一店铺样本的第二特征。进一步地,在一个实施例中,在第一店铺样本对应的各个语义标签中包含具有停业语义的标签的情况下,确定第一店铺样本的第二特征为,包含店铺为非真实存在店铺的语义。According to an embodiment, the second feature may be extracted by: obtaining first review information of a first store sample; and using a pre-trained semantic model to determine a semantic label corresponding to each piece of review data in the first review information, where The semantic tags include the semantics of going out of business or no semantics of going out of business; the second feature of the first store sample is determined according to each semantic tag. Further, in one embodiment, in a case where each semantic tag corresponding to the first store sample includes a tag with a closing semantics, the second feature of determining the first store sample is to include the semantics that the store is not a real store .
在一个可能的实施例中,利用预先训练的语义模型确定评论信息中各条评论数据的语义标签包括:针对第一评论信息中的第一评论数据,通过无监督词向量模型将第一评论数据中的各个词分别表示成各个词向量;基于各个词向量,确定第一评论数据对应的第一评论向量;将第一评论向量输入语义模型,以获取所义模型的输出结果;按照输出结果为第一评论数据添加语义标签。In a possible embodiment, using a pre-trained semantic model to determine the semantic label of each piece of review data in the review information includes: for the first review data in the first review information, using the unsupervised word vector model to convert the first review data Each word in the word is represented as each word vector; based on each word vector, a first review vector corresponding to the first review data is determined; the first review vector is input to a semantic model to obtain an output result of the meaning model; according to the output result, The first review data is semantically tagged.
在一个实施例中,上述特征还可以包括以下至少一个特征:评论数量特征、基本信息完备度特征、预定标识特征、店铺经营类别特征、消费者打分特征。In one embodiment, the above features may further include at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.
通过以上装置,可以充分利用互联网数据,提取有效的分类特征,从而提高店铺分类的有效性。Through the above devices, the Internet data can be fully utilized to extract effective classification features, thereby improving the effectiveness of store classification.
值得说明的是,图7所示的装置700与图5所示的方法相对应,因此,针对图5中的相关描述同样适用于装置700,在此不再赘述。It is worth noting that the apparatus 700 shown in FIG. 7 corresponds to the method shown in FIG. 5. Therefore, the related description in FIG. 5 is also applicable to the apparatus 700, and details are not described herein again.
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2或图5所描述的方法。According to another embodiment of the present invention, there is also provided a computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2 or FIG. 5.
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2或图5所述的方法。According to an embodiment of still another aspect, a computing device is further provided, which includes a memory and a processor. The memory stores executable code. When the processor executes the executable code, the implementation is implemented in combination with FIG. 2 or FIG. 5. The method described.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art should appreciate that, in one or more of the above examples, the functions described in the present invention may be implemented by hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored in or transmitted over as one or more instructions or code on a computer-readable medium.
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定 本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions, and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the present invention. The scope of protection, any modification, equivalent replacement, and improvement made on the basis of the technical solution of the present invention shall be included in the scope of protection of the present invention.

Claims (23)

  1. 一种分类模型的训练方法,所述分类模型用于判断店铺当前是否为真实存在的店铺,所述方法包括:A training method for a classification model, the classification model is used to determine whether a store is currently a real store, and the method includes:
    选择预定数量的店铺样本,所述店铺样本对应有店铺信息和分类标签,所述分类标签包括真实存在店铺标签和非真实存在店铺标签,所述店铺信息包括评论信息;Selecting a predetermined number of store samples, the store samples corresponding to store information and classification labels, the classification labels including real store labels and non-real store labels, and the store information including review information;
    基于所述店铺信息提取所述店铺样本的特征,其中,所述特征至少包括第一特征和第二特征,所述第一特征至少基于所述评论信息的时间相关属性而获取,所述第二特征基于所述评论信息中包含的与店铺真实性相关的语义描述而确定;Extracting a feature of the shop sample based on the shop information, wherein the feature includes at least a first feature and a second feature, the first feature is obtained based on at least a time-related attribute of the review information, and the second The characteristics are determined based on the semantic description related to the authenticity of the store included in the review information;
    基于各个店铺样本的所述特征和所述分类标签训练所述分类模型。The classification model is trained based on the features and the classification labels of each store sample.
  2. 根据权利要求1所述的方法,其中,选择预定数量的店铺样本包括:The method of claim 1, wherein selecting a predetermined number of store samples comprises:
    选择预定期限内具有以下至少一项行为的店铺作为正样本:销售代金券、团购活动、促销活动、订座服务、问答互动、广告投放、接收到顾客在客户端的签到,其中,所述正样本对应有真实存在店铺标签。Select a store that has at least one of the following behaviors within the predetermined period as a positive sample: sales of vouchers, group purchases, promotions, reservation services, Q & A interaction, advertising, and receipt of customers' check-ins on the client, where the positive samples Corresponds to the actual store label.
  3. 根据权利要求1所述的方法,其中,选择预定数量的店铺样本包括:The method of claim 1, wherein selecting a predetermined number of store samples comprises:
    选择满足以下条件的店铺作为负样本:在电子地图上被标注为永久停业,其中,所述负样本对应有非真实存在店铺标签。A store that meets the following conditions is selected as a negative sample: it is marked as permanently closed on the electronic map, and the negative sample corresponds to a non-real existence store label.
  4. 根据权利要求1所述的方法,其中,所述第一特征包括以下中的一项或多项:最新评论的时间、最新评论距离当前时间的时长、预定时间段内的评论数增量。The method according to claim 1, wherein the first characteristic comprises one or more of the following: time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.
  5. 根据权利要求1所述的方法,其中,所述第二特征通过以下方法提取:The method according to claim 1, wherein the second feature is extracted by the following method:
    获取与第一店铺样本相对应的第一评论信息;Acquiring first review information corresponding to a first store sample;
    利用预先训练的语义模型确定所述第一评论信息中各条评论数据分别对应的语义标签,其中,所述语义标签包括具有停业语义或不具有停业语义;Determining a semantic tag corresponding to each piece of review data in the first review information by using a pre-trained semantic model, wherein the semantic tag includes a semantic that has closed or no semantics that is closed;
    按照各个语义标签确定所述第一店铺样本的第二特征。The second feature of the first store sample is determined according to each semantic tag.
  6. 根据权利要求5所述的方法,其中,所述按照所述各个语义标签确定所述第一店铺样本的第二特征包括:The method according to claim 5, wherein determining the second feature of the first store sample according to the respective semantic tags comprises:
    在各个语义标签中包含具有停业语义的标签的情况下,确定所述第一店铺样本的第二特征为,包含店铺为非真实存在店铺的语义。In the case where each semantic tag includes a tag with a closing semantics, it is determined that the second feature of the first store sample is to include the semantics that the store is a non-real existence store.
  7. 根据权利要求5所述的方法,其中,所述语义模型包括,通过标注的评论数据集训练的监督模型。The method of claim 5, wherein the semantic model comprises a supervised model trained on annotated comment data sets.
  8. 根据权利要求5所述的方法,其中,利用预先训练的语义模型确定所述第一评论信息中各条评论数据分别对应的语义标签包括:The method according to claim 5, wherein using a pre-trained semantic model to determine the semantic labels corresponding to each piece of comment data in the first comment information comprises:
    针对所述第一评论信息中的第一评论数据,通过无监督词向量模型将所述第一评论数据中的各个词分别表示成各个词向量;For the first review data in the first review information, each word in the first review data is represented as each word vector through an unsupervised word vector model;
    基于所述各个词向量,确定所述第一评论数据对应的第一评论向量;Determining a first review vector corresponding to the first review data based on the respective word vectors;
    将所述第一评论向量输入所述语义模型,以获取所述语义模型的输出结果;Inputting the first comment vector into the semantic model to obtain an output result of the semantic model;
    按照所述输出结果为所述第一评论数据添加语义标签。Adding a semantic tag to the first comment data according to the output result.
  9. 根据权利要求1所述的方法,其中,所述特征还包括以下至少一个特征:评论数量特征、基本信息完备度特征、预定标识特征、店铺经营类别特征、消费者打分特征。The method according to claim 1, wherein the characteristics further include at least one of the following characteristics: a number of comments characteristic, a basic information completeness characteristic, a predetermined identification characteristic, a store operation category characteristic, and a consumer scoring characteristic.
  10. 根据权利要求1所述的方法,其中,所述店铺样本还包括测试样本,以及The method of claim 1, wherein the store sample further comprises a test sample, and
    所述方法还包括:The method further includes:
    检测所述分类模型针对各个测试样本的各个输出结果的准确性,以根据各个输出结果的准确性获得对所述分类模型的检测结果;Detecting the accuracy of each output result of the classification model for each test sample, so as to obtain the detection result of the classification model according to the accuracy of each output result;
    根据所述检测结果调整所述分类模型,直至所述检测结果满足预设条件。Adjust the classification model according to the detection result until the detection result meets a preset condition.
  11. 一种店铺分类的方法,利用权利要求1-10中任一训练的分类模型判断店铺当前是否为真实存在的店铺,所述方法包括:A method for classifying a store, using the classification model trained in any one of claims 1-10 to determine whether the store is currently a real store, the method includes:
    获取待分类店铺的店铺信息,其中,所述店铺信息包括评论信息;Acquiring store information of a store to be classified, where the store information includes review information;
    基于所述店铺信息提取所述待分类店铺的特征,其中,所述特征至少包括第一特征和第二特征,所述第一特征至少基于所述评论信息的时间相关属性而获取,所述第二特征基于所述评论信息中包含的与店铺真实性相关的语义描述而确定;The characteristics of the store to be classified are extracted based on the store information, wherein the characteristics include at least a first characteristic and a second characteristic, the first characteristic is obtained based at least on the time-related attributes of the review information, and the first The second feature is determined based on the semantic description related to the authenticity of the store included in the review information;
    将所述待分类店铺的所述特征输入所述分类模型,以获取所述分类模型的输出结果;Inputting the characteristics of the store to be classified into the classification model to obtain an output result of the classification model;
    根据所述输出结果确定所述待分类店铺当前是否为真实存在的店铺。It is determined whether the store to be classified is currently a real store according to the output result.
  12. 一种分类模型的训练装置,所述分类模型用于判断店铺当前是否为真实存在的店铺,所述装置包括:A training device for a classification model. The classification model is used to determine whether a store is a real store. The device includes:
    选择单元,配置为选择预定数量的店铺样本,所述店铺样本对应有店铺信息和分类标签,所述分类标签包括真实存在店铺标签和非真实存在店铺标签,所述店铺信息包括评论信息;A selection unit configured to select a predetermined number of store samples, the store samples corresponding to store information and classification labels, the classification labels including real store labels and non-real store labels, and the store information including review information;
    提取单元,配置为基于所述店铺信息提取所述店铺样本的特征,其中,所述特征至少包括第一特征和第二特征,所述第一特征至少基于所述评论信息的时间相关属性而获取,所述第二特征基于所述评论信息中包含的与店铺真实性相关的语义描述而确定;An extraction unit configured to extract features of the store sample based on the store information, wherein the features include at least a first feature and a second feature, and the first feature is obtained based on at least a time-related attribute of the review information , The second feature is determined based on the semantic description related to the authenticity of the store included in the review information;
    训练单元,配置为基于各个店铺样本的所述特征和所述分类标签训练所述分类模型。A training unit configured to train the classification model based on the features and the classification labels of each store sample.
  13. 根据权利要求12所述的装置,其中,所述选择单元配置为:The apparatus according to claim 12, wherein the selection unit is configured to:
    选择预定期限内具有以下至少一项行为的店铺作为正样本:销售代金券、团购活动、 促销活动、订座服务、问答互动、广告投放、接收到顾客在客户端的签到,其中,所述正样本对应有真实存在店铺标签。Select a store that has at least one of the following behaviors within the predetermined period as a positive sample: sales of vouchers, group purchases, promotions, reservation services, Q & A interaction, advertisement placement, receipt of customer sign-in on the client, wherein the positive sample Corresponds to the actual store label.
  14. 根据权利要求12所述的装置,其中,所述选择单元还配置为:The apparatus according to claim 12, wherein the selection unit is further configured to:
    选择满足以下条件的店铺作为负样本:在电子地图上被标注为永久停业,其中,所述负样本对应有非真实存在店铺标签。A store that meets the following conditions is selected as a negative sample: it is marked as permanently closed on the electronic map, and the negative sample corresponds to a non-real existence store label.
  15. 根据权利要求12所述的装置,其中,所述第一特征包括以下中的一项或多项:最新评论的时间、最新评论距离当前时间的时长、预定时间段内的评论数增量。The device according to claim 12, wherein the first characteristic comprises one or more of the following: time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments within a predetermined time period.
  16. 根据权利要求12所述的装置,其中,提取所述第二特征时,所述提取单元还包括:The apparatus according to claim 12, wherein when extracting the second feature, the extraction unit further comprises:
    评论信息获取模块,配置为获取与第一店铺样本对应的第一评论信息;A review information acquisition module configured to acquire first review information corresponding to a first store sample;
    语义标签确定模块,配置为利用预先训练的语义模型确定所述第一评论信息中各条评论数据分别对应的语义标签,其中,所述语义标签包括具有停业语义或不具有停业语义;A semantic tag determination module configured to determine a semantic tag corresponding to each piece of review data in the first review information by using a pre-trained semantic model, wherein the semantic tag includes a semantics of closed or non-closed semantics;
    第二特征确定模块,配置为按照各个语义标签确定所述第一店铺样本的第二特征。A second feature determination module is configured to determine a second feature of the first store sample according to each semantic tag.
  17. 根据权利要求16所述的装置,其中,所述第二特征确定模块进一步配置为:The apparatus according to claim 16, wherein the second feature determination module is further configured to:
    在各个语义标签中包含具有停业语义的标签的情况下,确定所述第一店铺样本的第二特征为,包含店铺为非真实存在店铺的语义。In the case where each semantic tag includes a tag with a closing semantics, it is determined that the second feature of the first store sample is to include the semantics that the store is a non-real existence store.
  18. 根据权利要求16所述的装置,其中,所述语义标签确定模块进一步配置为:The apparatus according to claim 16, wherein the semantic label determination module is further configured to:
    针对所述第一评论信息中的第一评论数据,通过无监督词向量模型将所述第一评论数据中的各个词分别表示成各个词向量;For the first review data in the first review information, each word in the first review data is represented as each word vector through an unsupervised word vector model;
    基于所述各个词向量,确定所述第一评论数据对应的第一评论向量;Determining a first review vector corresponding to the first review data based on the respective word vectors;
    将所述第一评论向量输入所述语义模型,以获取所述语义模型的输出结果;Inputting the first comment vector into the semantic model to obtain an output result of the semantic model;
    按照所述输出结果为所述第一评论数据添加语义标签。Adding a semantic tag to the first comment data according to the output result.
  19. 根据权利要求12所述的装置,其中,所述特征还包括以下至少一个特征:评论数量特征、基本信息完备度特征、预定标识特征、店铺经营类别特征、消费者打分特征。The device according to claim 12, wherein the characteristics further include at least one of the following characteristics: number of reviews characteristics, basic information completeness characteristics, predetermined identification characteristics, store operation category characteristics, and consumer scoring characteristics.
  20. 根据权利要求12所述的装置,其中,所述店铺样本还包括测试样本,以及The apparatus according to claim 12, wherein the store sample further includes a test sample, and
    所述装置还包括:The device further includes:
    测试模块,配置为检测所述分类模型针对各个测试样本的各个输出结果的准确性,以根据各个输出结果的准确性获得对所述分类模型的检测结果;A testing module configured to detect the accuracy of each output result of the classification model for each test sample, so as to obtain the detection result of the classification model according to the accuracy of each output result;
    调整模块,配置为在所述检测结果不满足预设条件的情况下,根据所述检测结果调 整所述分类模型。An adjustment module is configured to adjust the classification model according to the detection result when the detection result does not satisfy a preset condition.
  21. 一种店铺分类的装置,利用权利要求12-20中任一训练装置训练的分类模型,判断店铺当前是否为真实存在的店铺,所述装置包括:A device for classifying a store, using the classification model trained by any of the training devices of claims 12-20, to determine whether the store is currently a real store, and the device includes:
    获取单元,配置为获取待分类店铺的店铺信息,其中,所述店铺信息包括评论信息;An obtaining unit configured to obtain store information of a store to be classified, wherein the store information includes review information;
    提取单元,配置为基于所述店铺信息提取所述待分类店铺的特征,其中,所述特征至少包括第一特征和第二特征,所述第一特征至少基于所述评论信息的时间相关属性而获取,所述第二特征基于所述评论信息中包含的与店铺真实性相关的语义描述而确定;An extraction unit configured to extract features of the store to be classified based on the store information, wherein the features include at least a first feature and a second feature, and the first feature is based on at least a time-related attribute of the review information; Acquiring, the second feature is determined based on a semantic description related to the authenticity of the store included in the review information;
    分类单元,配置为将所述待分类店铺的所述特征输入所述分类模型,以获取所述分类模型的输出结果;A classification unit configured to input the characteristics of the store to be classified into the classification model to obtain an output result of the classification model;
    确定单元,配置为根据所述输出结果确定所述待分类店铺当前是否为真实存在的店铺。The determining unit is configured to determine, according to the output result, whether the store to be classified is a store that actually exists.
  22. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-10中任一项的所述的方法,或者权利要求11所述的方法。A computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, the computer is caused to execute the method according to any one of claims 1 to 10, or claim 11 Methods.
  23. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-10中任一项所述的方法,或者权利要求11所述的方法。A computing device includes a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the processor according to any one of claims 1 to 10 is implemented. The method, or the method of claim 11.
PCT/CN2019/080022 2018-06-25 2019-03-28 Classification model training method and store classification method and device WO2020001106A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810662702.4A CN108985347A (en) 2018-06-25 2018-06-25 Training method, the method and device of shop classification of disaggregated model
CN201810662702.4 2018-06-25

Publications (1)

Publication Number Publication Date
WO2020001106A1 true WO2020001106A1 (en) 2020-01-02

Family

ID=64538738

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/080022 WO2020001106A1 (en) 2018-06-25 2019-03-28 Classification model training method and store classification method and device

Country Status (3)

Country Link
CN (1) CN108985347A (en)
TW (1) TW202001736A (en)
WO (1) WO2020001106A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625721A (en) * 2020-05-26 2020-09-04 汉海信息技术(上海)有限公司 Content recommendation method and device
CN112328899A (en) * 2020-11-27 2021-02-05 京东数字科技控股股份有限公司 Information processing method, information processing apparatus, storage medium, and electronic device
CN112561530A (en) * 2020-12-25 2021-03-26 民生科技有限责任公司 Transaction flow processing method and system based on multi-model fusion
CN115131068A (en) * 2022-07-08 2022-09-30 连连(杭州)信息技术有限公司 Shop classification method and device and computer storage medium

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985347A (en) * 2018-06-25 2018-12-11 阿里巴巴集团控股有限公司 Training method, the method and device of shop classification of disaggregated model
CN109685555A (en) * 2018-12-13 2019-04-26 拉扎斯网络科技(上海)有限公司 Trade company's screening technique, device, electronic equipment and storage medium
CN109697637B (en) * 2018-12-27 2022-08-26 拉扎斯网络科技(上海)有限公司 Object type determination method and device, electronic equipment and computer storage medium
CN109840831A (en) * 2019-01-29 2019-06-04 浙江口碑网络技术有限公司 Page rendering method and device
CN109993545A (en) * 2019-02-01 2019-07-09 阿里巴巴集团控股有限公司 The verification method and apparatus of solid shop/brick and mortar store
CN110334306A (en) * 2019-06-21 2019-10-15 无线生活(北京)信息技术有限公司 Label processing method and device
CN111008331B (en) * 2019-11-29 2023-09-15 拉扎斯网络科技(上海)有限公司 Store-side display method and device, electronic equipment and storage medium
CN111368761B (en) * 2020-03-09 2022-12-16 腾讯科技(深圳)有限公司 Shop business state recognition method and device, readable storage medium and equipment
CN114339859B (en) * 2020-09-27 2023-08-15 中国移动通信集团广东有限公司 Method and device for identifying WiFi potential users of full-house wireless network and electronic equipment
CN114519114A (en) * 2020-11-20 2022-05-20 北京达佳互联信息技术有限公司 Multimedia resource classification model construction method and device, server and storage medium
CN113449169B (en) * 2021-09-01 2021-12-14 广州越创智数信息科技有限公司 Public opinion data acquisition method and system based on RPA

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108111A1 (en) * 2012-10-12 2014-04-17 Redpixtec. Gmbh Mobile advertising system
CN105095387A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Method and device for POI data collection based on user comment information
CN105808679A (en) * 2016-03-02 2016-07-27 陈健强 Electronic map based store operation state marking method and system
CN107092641A (en) * 2017-02-27 2017-08-25 口碑控股有限公司 Determination methods and device, the method and apparatus of shop search of shop business status
CN108985347A (en) * 2018-06-25 2018-12-11 阿里巴巴集团控股有限公司 Training method, the method and device of shop classification of disaggregated model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866542B (en) * 2015-05-05 2018-07-06 腾讯科技(深圳)有限公司 A kind of POI data verification method and device
CN108197177B (en) * 2017-12-21 2019-12-17 北京三快在线科技有限公司 Business object monitoring method and device, storage medium and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108111A1 (en) * 2012-10-12 2014-04-17 Redpixtec. Gmbh Mobile advertising system
CN105095387A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Method and device for POI data collection based on user comment information
CN105808679A (en) * 2016-03-02 2016-07-27 陈健强 Electronic map based store operation state marking method and system
CN107092641A (en) * 2017-02-27 2017-08-25 口碑控股有限公司 Determination methods and device, the method and apparatus of shop search of shop business status
CN108985347A (en) * 2018-06-25 2018-12-11 阿里巴巴集团控股有限公司 Training method, the method and device of shop classification of disaggregated model

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625721A (en) * 2020-05-26 2020-09-04 汉海信息技术(上海)有限公司 Content recommendation method and device
CN111625721B (en) * 2020-05-26 2023-12-22 汉海信息技术(上海)有限公司 Content recommendation method and device
CN112328899A (en) * 2020-11-27 2021-02-05 京东数字科技控股股份有限公司 Information processing method, information processing apparatus, storage medium, and electronic device
CN112328899B (en) * 2020-11-27 2024-04-16 京东科技控股股份有限公司 Information processing method, information processing apparatus, storage medium, and electronic device
CN112561530A (en) * 2020-12-25 2021-03-26 民生科技有限责任公司 Transaction flow processing method and system based on multi-model fusion
CN115131068A (en) * 2022-07-08 2022-09-30 连连(杭州)信息技术有限公司 Shop classification method and device and computer storage medium
CN115131068B (en) * 2022-07-08 2023-12-26 连连(杭州)信息技术有限公司 Shop classification method, device and computer storage medium

Also Published As

Publication number Publication date
TW202001736A (en) 2020-01-01
CN108985347A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
WO2020001106A1 (en) Classification model training method and store classification method and device
CN108154401B (en) User portrait depicting method, device, medium and computing equipment
US20240062271A1 (en) Recommendations Based Upon Explicit User Similarity
US8600796B1 (en) System, method and computer program product for identifying products associated with polarized sentiments
US8818788B1 (en) System, method and computer program product for identifying words within collection of text applicable to specific sentiment
CN110135901A (en) A kind of enterprise customer draws a portrait construction method, system, medium and electronic equipment
WO2017190610A1 (en) Target user orientation method and device, and computer storage medium
CN109118316B (en) Method and device for identifying authenticity of online shop
US20160140627A1 (en) Generating high quality leads for marketing campaigns
US20180108029A1 (en) Detecting differing categorical features when comparing segments
CN112269805B (en) Data processing method, device, equipment and medium
CN106776897B (en) User portrait label determination method and device
CN109816134B (en) Method and device for predicting delivery address and storage medium
CN107832338B (en) Method and system for recognizing core product words
US20180285748A1 (en) Performance metric prediction for delivery of electronic media content items
Chen et al. Big data analytics on aviation social media: The case of china southern airlines on sina weibo
CN112925973A (en) Data processing method and device
CN116029637A (en) Cross-border electronic commerce logistics channel intelligent recommendation method and device, equipment and storage medium
KR20170028207A (en) Method and apparatus for analyzing pattern of consumption/interest
CN111091409A (en) Client tag determination method and device and server
Zhao et al. Online comments of multi-category commodities based on emotional tendency analysis
JP2020057206A (en) Information processing device
US11487835B2 (en) Information processing system, information processing method, and program
CN112015970A (en) Product recommendation method, related equipment and computer storage medium
CN110930103A (en) Service ticket checking method and system, medium and computer system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19827203

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19827203

Country of ref document: EP

Kind code of ref document: A1