WO2020001106A1

WO2020001106A1 - Classification model training method and store classification method and device

Info

Publication number: WO2020001106A1
Application number: PCT/CN2019/080022
Authority: WO
Inventors: 谢仁强; 马书超
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-06-25
Filing date: 2019-03-28
Publication date: 2020-01-02
Also published as: TW202001736A; CN108985347A

Abstract

A classification model training method and a store classification method and device. During the training of a classification model, store information corresponding to a selected store sample comprises review information. The store information is used to extract features of the store sample, the features comprising: a first feature obtained at least on the basis of a time-related attribute of the review information; and a second feature determined on the basis of a semantic description comprised in the review information and related to the existence or non-existence of the store. When the trained classification model is used to perform store classification, features extracted from stores to be classified also comprise the first feature and the second feature. In this way, internet data can be fully utilized to improve the effectiveness of store classification.

Description

Classification model training method, store classification method and device

Technical field

One or more embodiments of the present specification relate to the field of computer technology, and in particular, to a training method of a computer classification model, a method and a device of store classification.

Background technique

With the development of computer and Internet technologies, more and more network platforms or applications are in contact in people's lives, such as dating applications, shopping applications, ordering applications, map applications, and so on. Among them, when users use some applications that can recommend stores (such as ordering applications, map applications, etc.), these applications are very important to describe the business status of the stores (such as whether they are closed). For example, if a user wants to eat Mala Tang, search for a nearby Mala Tang store according to the map, and walk along the map but find that the store is closed, which will cause a bad experience for the user.

Therefore, it is necessary to make full use of Internet data, and by extracting effective training features and training a classification model with high accuracy, determine which shops are closed, thereby improving the effectiveness of shop classification.

Summary of the invention

One or more embodiments of the present specification describe a method and device that can make full use of Internet data, and by extracting effective training features, train a classification model with higher accuracy, and accurately determine which stores are closed when the store is classified. , Thereby improving the effectiveness of store classification.

According to a first aspect, a training method for a classification model is provided. The classification model is used to determine whether a store is currently a real store, including: selecting a predetermined number of store samples, the store samples corresponding to store information and classification A label, the classification label includes a real store label and a non-real store label, the store information includes review information, and features of the store sample are extracted based on the store information, wherein the features include at least a first feature and The second feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; based on each store sample Training the classification model with the features and the classification labels.

In one embodiment, selecting a predetermined number of store samples includes: selecting, as a positive sample, stores that have at least one of the following behaviors within a predetermined period: sales vouchers, group purchases, promotions, reservation services, Q & A interactions, advertising, A check-in of the customer at the client is received, wherein the positive sample corresponds to a real store label.

In one embodiment, selecting a predetermined number of store samples includes: selecting a store that meets the following conditions as a negative sample: it is marked as permanently closed on the electronic map, and the negative sample corresponds to a non-real existence store label.

In a possible embodiment, the first feature includes one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.

According to a possible design, the second feature is extracted by: obtaining the first review information corresponding to a first store sample; and using a pre-trained semantic model to determine each piece of review data in the first review information Respectively corresponding semantic labels, wherein the semantic labels include closed semantics or non-closed semantics; and determine the second feature of the first store sample according to each semantic label.

Further, in an implementation, determining the second feature of the first store sample according to each semantic tag includes: determining the first store sample in a case where each semantic tag includes a tag with a closing semantics. The second feature is that it contains the semantics that the store is not a real store.

In one embodiment, the semantic model includes a supervised model trained on a labeled review dataset.

In a possible embodiment, using a pre-trained semantic model to determine the corresponding semantic tag of each piece of review data in the first review information includes: for the first review data in the first review information, through unsupervised A word vector model represents each word in the first review data as a respective word vector; based on the respective word vectors, determining a first review vector corresponding to the first review data; and inputting the first review vector The semantic model to obtain an output result of the semantic model; and adding a semantic label to the first comment data according to the output result.

In one embodiment, the features further include at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.

According to a possible embodiment, the store sample further includes a test sample, and the method further includes: detecting the accuracy of each output result of the classification model for each test sample, to obtain according to the accuracy of each output result A detection result of the classification model; and adjusting the classification model according to the detection result until the detection result meets a preset condition.

According to a second aspect, a method for classifying a store is provided, using the classification model trained in any of the methods of the first aspect to determine whether a store is currently a real store, the method includes: obtaining store information of a store to be classified, wherein, The store information includes review information; features of the store to be classified are extracted based on the store information, wherein the features include at least a first feature and a second feature, and the first feature is based at least on a time of the review information Related attributes are obtained, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; the feature of the store to be classified is input into the classification model to obtain the classification An output result of the model; determining whether the store to be classified is a real store currently according to the output result.

According to a third aspect, a training device for a classification model is provided. The classification model is used to determine whether a store is currently a real store. The device includes a selection unit configured to select a predetermined number of store samples. The samples correspond to store information and classification labels, the classification labels including real store labels and non-real store labels, the store information including review information, and an extraction unit configured to extract features of the store sample based on the store information , Wherein the feature includes at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is based on the authenticity with the store contained in the review information It is determined based on the sex-related semantic description; a training unit configured to train the classification model based on the characteristics and the classification tags of each store sample.

According to a fourth aspect, a device for classifying a store is provided. The classification model trained by the training device of the third aspect is used to determine whether a store is currently a real store. The device includes: an obtaining unit configured to obtain the information of a store to be classified. Corresponding to store information, where the store information includes review information; an extraction unit configured to extract features of the store to be classified based on the store information, wherein the features include at least a first feature and a second feature, so The first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; the classification unit is configured to set the waiting information The feature of the classified store is input to the classification model to obtain an output result of the classification model; a determining unit is configured to determine whether the to-be-categorized store is currently a true store according to the output result.

According to a fifth aspect, there is provided a computer-readable storage medium having stored thereon a computer program, which when executed in a computer, causes the computer to execute the method of the first aspect or the second aspect.

According to a sixth aspect, there is provided a computing device including a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the first aspect or the first aspect is implemented. Two ways.

With the method and device provided in the embodiments of the present specification, when training the classification model, the store information corresponding to the selected store sample includes review information, and the features of the store sample extracted from the store information include information obtained based on at least time-related attributes of the review information. The first feature and the second feature determined based on the semantic description related to the authenticity of the store included in the review information. In this way, the Internet data can be fully utilized to extract effective training features and train a classification model with higher accuracy. When using the trained classification model to classify the stores, the extracted features of the stores to be classified also include the above-mentioned first and second features. In this way, the Internet data can be fully utilized to improve the accuracy of the store classification, and thereby improve the store classification. Effectiveness.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings used in the description of the embodiments are briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the present invention. Those of ordinary skill in the art can also obtain other drawings according to these drawings without paying creative labor.

FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification; FIG.

2 shows a flowchart of a training method of a classification model according to an embodiment;

FIG. 3 shows a specific example of the second feature extraction;

FIG. 4 shows a specific example of the model training process;

FIG. 5 shows a flowchart of a store classification method according to an embodiment; FIG.

6 shows a schematic block diagram of a training device for a classification model according to an embodiment;

FIG. 7 shows a schematic block diagram of a store classification device according to an embodiment.

detailed description

The solutions provided in this specification are described below with reference to the drawings.

FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification. As shown in the figure, users can view store information through client applications, such as map applications, shopping applications, ordering applications, and so on. The client application here can run on various terminal devices with data processing capabilities, such as smart phones, tablet computers, desktop computers, smart watches, and so on. The store information displayed on the client application is provided through the server. The server may be a processing device with a certain data processing capability, or a processing device cluster. The computing platform trains a classification model, and the server uses the classification model to classify the store, determine whether the store is a real store, and display it to the user through a client application. It can be understood that the real existence here refers to the fact that the store is a real store, and there is no permanent closure or bankruptcy. It does not include the case of a short (such as two days) suspension of business.

It is worth noting that the computing platform may be set in a server or a processing device independent of the server, which is not limited in this application. The classification model trained by the computing platform can be reused by the server. The results of the server's classification of the store through the classification model can also be reused.

The computing platform may first select a predetermined number of store samples, perform feature extraction on the store samples, and then train a classification model based on the extracted features and known classification results. Wherein, the store information corresponding to the selected store sample may include review information, so that when the features are extracted, the review information may be used to obtain the first feature based on at least the time-related attributes of the review information, and based on the reviews and information contained in the review information. The second feature is determined by the authenticity-related semantic description. In this way, it is possible to make full use of Internet data, extract effective training features, and train a classification model with higher accuracy.

The server uses the classification model trained by the computing platform to classify the stores to be classified. The server may first obtain the corresponding store information of the store to be classified, where the store information includes review information, and then extract the characteristics of the store to be classified based on the store information to input the training model trained by the computing platform to obtain the output result of the classification model. And according to the output result, determine whether the store to be classified is currently a real store. Correspondingly, the features extracted by the server to be classified by the server also include the above-mentioned first features and second features extracted from the review information. In this way, it is possible to make full use of Internet data, extract effective features, improve the accuracy of store classification, and thereby make store classification results more effective.

When the user views the store information through a client application, such as a map application, shopping application, ordering application, etc., the store information sent by the server to the client may include only store information of non-closed stores, or store information of all stores . When the store information sent by the server to the client includes store information of all stores, the store information may also include information on whether the store is closed.

It is worth noting that FIG. 1 only shows a specific implementation scenario of an embodiment disclosed in this specification, but it does not limit the scope of the implementation scenarios of the embodiments of this specification. For example, in another implementation scenario, Including the client in Figure 1, and so on.

The specific execution process of the above scenario is described below.

FIG. 2 shows a flowchart of a training method of a classification model according to an embodiment. The execution subject of the method may be a system, equipment, device, platform or server with certain computing and data processing capabilities, such as the computing platform shown in FIG. 1. The classification model involved in this method can be used to determine whether the store is currently a real store.

As shown in FIG. 2, the method includes the following steps: Step 21: Select a predetermined number of store samples. The store samples correspond to store information and classification labels. The classification labels include real store labels and non-real store labels. The store information includes comments. Information; step 22, extracting the features of the store sample based on the store information, wherein the above features include at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is based on the review information The contained semantic description related to the authenticity of the store is determined; step 23, a classification model is trained based on the characteristics and classification tags of each store sample.

First, in step 21, a predetermined number of store samples are selected, and the store samples correspond to store information and classification labels. Here, the classification label includes a real store label and a non-real store label. It is understandable that user reviews are often formed by the user ’s intuitive and real experience of the store. There is a real gap between the real store and the non-real store. For example, the non-real store may have no reviews or fewer reviews. . Therefore, the review information may have a large influence on the judgment of the classification of the store. In this way, the store information corresponding to the store sample may include at least review information. The comment information may include comment content, comment time, number of comments, and so on.

In one embodiment, the store information can be crawled from a predetermined website (for example, XX reviews, etc.) by a web crawler (such as python). For example, you can crawl user registration information or content distribution information in the predetermined website. Then, the store information can be obtained through the type of registered user (such as a store or a consumer) in the user registration information, the type of the content (such as a sale or a purchase) in the content distribution information, and the like. If the type of the published content is sale information, the user who posted the information may be the store side, from which the store name, store location, and review information can be obtained. In practice, you can also search on the electronic map based on information such as store name and store location to determine the classification label of the store. For example, stores that are not searchable on the electronic map are non-existent stores.

In another embodiment, a sample of the store can also be collected manually offline, for example, by manually checking the store address on the website or map one by one to determine its classification label. At the same time, it can also be performed by phone, search engine, administrative At least one of the management department registration information, etc., to obtain the store information of the corresponding store. The review information in the store information can be obtained by, for example, a phone call, a "question and answer" in a search engine, and the like.

In more embodiments, store samples of known classification tags may also be obtained through acquisition channels that include more aspects, which are not described in detail here.

It can be understood that, for the obtained shop samples, a preliminary screening is needed, and a predetermined number of shop samples are selected from them. Store samples can include positive and negative samples. Among them, a positive sample may correspond to a real store label, and a negative sample may correspond to a non-real store label.

In a possible embodiment, a store that has at least one of the following behaviors within a predetermined period (such as one month) can be selected as a positive sample: sales of vouchers, group purchase activities, promotional activities (such as discounts, etc.), reservation services, Q & A interaction , Advertising, receiving customer check-ins on the client, etc. In practice, some sales methods may be used in store operations, such as selling vouchers, organizing group purchases, organizing promotional activities, etc. Some stores (such as hotels, restaurants, etc.) can provide reservation services, and some stores will be available on related websites ( (Such as travel strategy websites, etc.) to conduct some Q & A interactions with consumers or potential consumers, and some stores will cooperate with some websites to place ads to increase page views or search rankings. In addition, some stores can receive customers' check-ins in the store through an application (such as a food review website). If the customer clicks the check-in on the client's store page, the deviation between the check-in location and the store location is within a set distance range (such as 80 meters ), The sign-in is successful. Generally, the store that provides the check-in may be a real store, and when the customer visits the store for consumption, the check-in is performed. Therefore, a store that has one of the above behaviors within the current or predetermined period can be determined as a positive sample, and these store samples that are positive samples can be assigned real store label.

In a possible embodiment, a store that meets the following conditions may be selected as a negative sample: it is marked as permanently closed on the electronic map. In some map applications, when a store is permanently closed, the store will be deleted from the map or marked as permanently closed. Therefore, you can use the store name and store location to search. For stores marked as permanently closed for electronic map applications, use the electronic map to confirm that the store location is correct, and use them as negative samples, and assign these store samples that are negative samples to be non-real. Shop labels.

While obtaining the store sample, the store information corresponding to the store sample can also be obtained. In addition to the review information, the store information may include, for example, a store name, a store address, and the like. In some embodiments, the store information may further include, but is not limited to, at least one of the following: basic store information, such as phone number, business hours, whether a wireless network connection is provided (such as wifi connection, etc.); store brand name, such as ×× 包子铺Etc .; shop labels given by the website or administrative supervision department, such as overseas food selection, local tourism bureau recommendations, etc .; shop classification, such as food, shopping, hotels, etc.

Understandably, non-real stores are shops that have been permanently closed, and their number is often smaller than real stores. According to a possible design, down-sampling the obtained store samples with real store labels can be made to make the number of store samples with real store labels and store samples with non-real store labels approximately equal, for example, 45000 Each.

Next, in step 22, the features of the store sample are extracted based on the store information. In this embodiment, the above features include at least a first feature and a second feature. The first feature is obtained based on at least the time-related attributes in the review information, and the second feature is based on the semantic description related to the authenticity of the store included in the review information. And ok. It is worth noting that the "first" and "second" in the "first feature" and "second feature" are only used to distinguish between two different features, and do not indicate a sequence limitation.

The time-related attributes of the comment information may include, but are not limited to, at least one of the following: the time when the comment was posted (such as May 1, 2018, etc.), the length of the comment from the current time (such as 10 hours, 20 days, etc.), and the reservation The number of comments (such as 100) in a time period (such as 2 days) and so on. It can be understood that a real store may constantly have new consumers to consume and comment. Therefore, the latest review time is often late, and the length of the review from the current time is small. At the same time, the number of reviews in the predetermined time period increases. It is more likely; instead of a real store, because there are no new consumers, the review time is earlier, the review is longer than the current time, and the possibility of increasing reviews within a predetermined period is less.

Accordingly, the first feature may include, but is not limited to, one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and the increment of the number of comments within a predetermined time period. Here, the latest review time may be the time of the latest review from the current time. For example, in the review information of a shop sample, there is no other review after a review at 20:00 on March 2, 2015. The comment time is at 20:00 on March 2, 2015. The length of the latest comment from the current time can be the time difference between the current time and the latest comment time, such as 30 days. The increment of the number of comments in a predetermined time period, that is, the amount of change in the total number of comments every predetermined time period. For example, suppose the predetermined time period is 3 months. According to the comment time, count the total number of comments every 3 months from the current time and calculate the increment of the number of comments. If the total number of comments in the last 3 months is 1000, the most recent The 3-month review increment is 1000. In this way, it is possible to make full use of the time-related attribute data of the review information of the shop samples on the Internet.

The semantic description related to the authenticity of the store contained in the review information may be a semantic description containing the store closed or in a good business condition. For example, "the store is closed and no longer exists", it may be a semantic description that the store has been permanently closed. For the same comment sentence, different information such as the time of release may also mean different meanings. For example, for a restaurant, a comment "Da Lao Yuan came over and has been closed" may be expressed at 12 pm It means that the restaurant is closed, and the announcement at 12 noon may mean that the restaurant is closed. For a shop, a very small number of comments (such as 1) that contain the semantics of expressing a shop closure may indicate that the shop has been permanently closed. Therefore, the feature may include a second feature that can reflect whether the review information has a semantic description of the store being permanently closed.

The second feature may be expressed in words, for example, having a semantic description of the store permanently closed or including a semantic description related to the authenticity of the store, not having a semantic description of the store permanently closed or not including a semantic description related to the authenticity of the store, and so on. The second feature may also be represented by a numerical value, for example, the second feature is 1 in the case of having a semantic description of the store permanently closed, the second feature is 0 in the case of having no semantic description of the store permanently closed, and so on.

As shown in FIG. 3, according to a possible design, the second feature can be extracted by the following methods: step 31, obtaining first review information corresponding to the first store sample; step 32, determining the first review using a pre-trained semantic model Semantic tags corresponding to each piece of review data in the information, wherein the semantic tags include closed or non-closed semantics; step 33, determine the second feature of the first store sample according to each semantic tag. It is worth noting that the "first" in the "first store sample" and "first review information" referred to here means "some", "one of them", "any one", and the store samples and reviews Correspondence of information, not the order, or the distinction between store samples.

For any one store sample, in step 31, the review information of the store sample may be obtained first. The review information of a shop sample may correspond to one or more pieces of review data. Each review data may include a review content, a review time, and data such as a user ID who posted the review.

Next, in step 32, a pre-trained semantic model is used to determine the semantic label corresponding to each piece of review data in the review information. Understandably, each piece of comment data can correspond to a semantic tag. Each piece of comment data can be input into a pre-trained semantic model, and the semantic label of a piece of comment data can be determined according to the output of the semantic model. Among them, the semantic model can be trained through a pre-annotated comment set.

As an example, some reviews can be selected from the review data of multiple store samples and added to the review set, especially for review data containing review sentences such as "closed", "closed", etc., and determined through manual identification and labeling The semantic labels of these review data are used as known semantic labels to train a supervised model, such as a logistic regression LR (logistics regression) model. Model training is a process of determining model parameters with known inputs (such as comment sentences) and outputs (such as known semantic labels), and will not be repeated here. Wherein, the semantic label of the review data may include the semantics with or without closing semantics.

The output of the semantic model can be one of the semantic labels directly, or it can be a numerical value, such as 1, 0, and so on. Among them, the output of the semantic model is one of two possible values (such as 1, 0, etc.), where each value corresponds to a semantic label, such as 1 corresponding to a closed business semantic label. The output of the semantic model can also be one of multiple possible values (such as any decimal between 0-1, etc.). A threshold can be set to determine which semantic label the output value is more biased to, such as greater than 0.6. Prefer to have closed semantic labels.

According to one embodiment, for each piece of review data in the review information, each word in the review data may be first expressed as a respective word vector through an unsupervised word vector model (such as the word2vec model); based on each word vector, it is determined A comment vector corresponding to the comment data; input the determined comment vector into a semantic model to obtain an output result of the semantic model; and add a semantic label to the comment data according to the output result. The review vector corresponding to the review data is determined based on each word vector. For example, the review vector may be an average of different dimensions of each word vector, or a weighted average of different dimensions.

For example, for the review data "the store has closed the door no longer exists", you can first cut the word, filter the words, etc. to get the words "the store", "close the door", "not exist", assuming the word vector The model has three dimensions a, b, and c, and each word is represented as a word vector:

In one implementation, the comment vector corresponding to the comment data determined based on each word vector may be:

In another implementation, the number of occurrences of each vocabulary can also be used as a weight, and a weighted average of different latitudes of each word vector to obtain a comment vector is:

The 1 in front of each parameter is the number of occurrences of the corresponding vocabulary, and the denominator is the sum of the number of occurrences of each vocabulary. In this example, the number of occurrences of each vocabulary is 1 and can be other values in practice.

Further, the comment vector can be input to the semantic model, so as to obtain the output of the semantic model. Understandably, the comment vector can also be expressed as

Each of them is input as a feature into the semantic model. Then, you can add semantic tags to the comment data according to the output of the semantic model. For example, the output of the semantic model is 1, and a semantic tag of "with closing semantics" is added to the comment data.

In this way, a semantic tag can be added to each piece of review data in the review information of a shop sample.

Step 33: Determine the second feature of the corresponding store sample according to each semantic tag corresponding to the store sample. The second feature may be determined as having a storehouse permanent shutdown semantic description or including a semantic description related to the authenticity of the store, without a store permanent shutdown semantic description or including a storefront-related semantic description, a value of 1, 0, etc .

Further, in one embodiment, in the case where any one of the semantic tags corresponding to the first store sample is a tag with a closing semantics, the second feature of determining the store sample is that the store includes a non-real existence The semantics of the store.

For some special cases, such as users venting their emotions, posting a comment as "this store should be closed for a long time", or it may be added with the tag of closing business semantics. Therefore, in another embodiment, a number threshold may also be set, and the second characteristic of the store sample is determined only when the number of comment data of tags with the semantics of closing business exceeds the number threshold (such as 10). To include the semantics that the store is not a real store.

In this way, the semantic description data related to the authenticity of the store in the review information of the sample of the store on the Internet can be fully utilized.

In one embodiment, in addition to the first feature and the second feature, the characteristics of the shop sample may also include the number of reviews, such as the total number of reviews, the number of positive reviews, the number of positive reviews, the ratio of the number of negative reviews, and reviews. Number of pictures in, etc. It can be understood that for shops with a large proportion of negative reviews, it is more likely to be non-real shops; for shops with a large total number of reviews or a large number of pictures in the reviews, it is likely to be real shops Sex. Therefore, the feature of the number of reviews can be used as a factor that influences whether a store is classified as a real store.

In one embodiment, the characteristics of the store sample may further include basic information completeness characteristics. Basic information such as telephone, business hours, whether wireless network connection is available (such as wifi, etc.), service facilities and so on. The more complete the basic information is, the more likely it is that the store exists. Optionally, the basic information completeness may be proportional to the number of basic information items. Therefore, the basic information completeness feature can be used as a factor that influences whether the store is classified as a real store.

In one embodiment, the characteristics of the store sample may further include predetermined identification characteristics. The predetermined identifier may be, for example, a preferred label (such as a local tourism bureau recommendation label) given by a brand store, a chain store, a website, or an administrative agency. Understandably, brand stores or chain stores often refer to stores with high visibility and market recognition. These stores are more likely to be real stores. Websites or administrative agencies give preferred labels to stores that have passed audits and inspections. These stores are more likely to be real stores. Therefore, the predetermined identification feature can be used as a factor that influences whether the store is classified as a real store.

In one embodiment, the characteristics of the store sample may further include store operation category characteristics. The store management category may be, for example, food, hotel, clothing, and so on. In some websites, there are many reviews for gourmet shops. If you only classify by the number of reviews, the accuracy is low. Therefore, you can also treat the shops in different business categories differently, and treat the shops with fewer reviews in the business category. , Give greater weight.

In one embodiment, the characteristics of the store sample may also include consumer scoring characteristics. Consumer ratings can be either points or star ratings. It is worth noting that if the store samples are obtained from the same website and the consumer scores have the same standard, the consumer scores can be directly used as the consumer score characteristics. If the store samples are not obtained from the same website, and the scoring standards may also be different, the ratio of consumer scores to full marks can be used as a consumer scoring feature. Consumer ratings will affect the customer flow of the store. If the customer flow is low, it is more likely to become a non-real store. Therefore, the consumer scoring characteristics can be used to influence whether the store is currently a real store. A factor in classification.

In more embodiments, the features of the store sample may also include more features, which will not be exemplified here.

In step 23, the classification model is trained based on the characteristics and classification labels of each store sample. It can be understood that the process of model training is the process of determining model parameters based on known input features and classification results. In this specification, the input feature is the feature of the store sample, where the feature includes multiple input features, and the classification result is determined according to the classification label of the store sample. For example, the output result includes 0, 1, 0 is a real store label, and 1 is not real. Store labels exist, and so on. A store sample corresponds to a set of known input features and classification results.

As shown in FIG. 4, during the training of the classification model, the known input features input in the input layer 42 are the characteristics of each store sample, and the output results of the output layer 43 can be compared with the classification labels of the corresponding store samples. According to the comparison result, various parameters of the intermediate layer 44 are adjusted, and weight parameters represented by the arrows between the features of the input layer 42 and the intermediate layer 44 and between the arrows of the intermediate layer 44 and the output layer 43 are represented.

In FIG. 4, the known input features input by the input layer 42 include a first feature 421 and a second feature 422, and the first feature 421 and the second feature 422 are respectively obtained from the review information 411 related data in the store information 41.

In a possible design, store samples can be divided into training samples and test samples. During the training of the classification model, the features of each training sample are used as input in order, and each classification parameter of the classification model is adjusted according to the comparison between the output of the classification model and the classification label, so that the output of the classification model is classified with the currently input training sample. The labels are more consistent to train the classification model. Next, the features of each test sample are input into the classification model trained by the training sample, and the classification labels corresponding to the test samples are used to detect the accuracy of each output result of the classification model to obtain the detection result of the classification model. For example, if the output of the classification label and the classification model are consistent, it is determined that the output of the classification model is correct. In this way, the detection results of the classification model on the entire test sample, such as accuracy, can be obtained.

In a case where the obtained detection result does not satisfy a predetermined condition, the classification model may be further adjusted according to the detection result. For example, adjust the grid structure of the classification model, change the classification model, and so on. For example, when the classification model is a GBDT model of gradient boosted decision tree, the number of trees, the depth of each tree, and the learning rate can be adjusted. After adjusting the classification model, use the training samples to train the classification model again, and use the test samples to obtain the detection results of the classification model. Until the test sample meets the preset conditions.

The preset condition here may be a condition set on a detection result of the classification model. For example, when the classification model is a gradient boosted decision tree GBDT model, the detection result may include values of the area under the curve, AUC, accuracy, recall, F1 score, and so on. For example, the preset conditions are that the accuracy and recall rate are both greater than 0.7 and so on. In one experiment according to the embodiment of the present specification, AUC = 0.868, accuracy = 0.767, recall rate = 0.803, and F1 is 0.784.

Reviewing the above process, the store information corresponding to the selected store sample includes review information. Therefore, the features extracted from the store information may include at least: a first feature obtained based on the time-related attributes of the review information, based on the review information The second feature determined by the semantic description related to the authenticity of the store. In this way, training a classification model based on features including the first feature and the second feature can make full use of Internet data to train a classification model with higher accuracy, thereby improving the effectiveness of store classification.

According to an embodiment of another aspect, a method for classifying a store is also provided. It is used to determine whether the store is a real store through a classification model. This method is suitable for an electronic device with a certain data processing capability, such as the server in FIG. 1.

As shown in FIG. 5, the embodiment of the method for classifying a store includes the steps of: step 51, obtaining store information of a store to be classified, where the store information includes review information; step 52, extracting characteristics of the store to be classified based on the store information, The feature includes at least a first feature and a second feature. The first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; step 53 , Inputting the characteristics of the store to be classified into the classification model to obtain an output result of the classification model; step 54, determine whether the store to be classified is currently a true store according to the output result.

First, in step 51, store information of a store to be classified is obtained. The store information includes at least review information, such as review content, review time, and number of reviews. The store information may also include but is not limited to at least one of the following: basic store information, store brand name, store label given by the website or administrative supervision department, store classification, etc. Store information can be crawled from a predetermined website (such as ×× comments, etc.) through a web crawler (such as python).

Next, in step 52, the features of the store to be classified are extracted based on the store information. The features here correspond to the input features of the classification model. The feature includes at least a first feature and a second feature. The first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information. It is worth noting that the "first" and "second" in the "first feature" and "second feature" are only used to distinguish between two different features, and do not indicate a sequence limitation.

The time-related attributes of the review information may include, but are not limited to, at least one of the following: a review posting time, a duration of the review from the current time, a number of reviews in a predetermined time period, and the like. Accordingly, the first feature may include, but is not limited to, one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and the increment of the number of comments within a predetermined time period. In this way, it is possible to make full use of the time-related attribute data of the review information of the shop samples on the Internet.

The semantic description related to the authenticity of the store contained in the review information may be a semantic description containing the store closed or in a good business condition. For a store, a very small number of comments (such as one) that contain the semantics of expressing the store's permanent closure may indicate that the store has been permanently closed. Therefore, the store can be classified according to whether the review information has the second feature of the semantic description that the store is permanently closed. The second feature can be expressed in words or numerically.

According to a possible design, the second feature can be extracted by: obtaining the review information of the store to be classified; using a pre-trained semantic model to determine the corresponding semantic tag of each piece of review data in the review information, wherein the semantic tag includes a closed Semantic or does not have closing semantics; the second feature of the store to be classified is determined according to each semantic tag corresponding to the store to be classified.

It is easy to understand that the review information of a store to be classified may correspond to one or more pieces of review data. Each review data may include a review content, a review time, and data such as a user ID who posted the review. Each piece of review data can be input into a pre-trained semantic model, and the semantic label of each piece of review data is determined based on the output of the semantic model. Then, the second feature of the store to be classified is determined according to these semantic tags. According to one embodiment, for each piece of review data in the review information, each word in the review data may be first expressed as a respective word vector through an unsupervised word vector model (such as the word2vec model); based on each word vector, it is determined A comment vector corresponding to the comment data; input the determined comment vector into a semantic model to obtain an output result of the semantic model; and add a semantic label to the comment data according to the output result.

In one embodiment, in the case where any of the semantic tags corresponding to the store to be classified is a tag with a closing semantics, the second feature of determining the store to be classified is to include the semantics that the store is not a real store . In another embodiment, a number threshold may also be set. When the number of comment data of tags with a closing semantics exceeds the number threshold, it is determined that the second characteristic of the store sample is that the store is not real. The semantics of the store.

In some possible designs, in addition to the first and second features, the characteristics of the store to be classified may include, but are not limited to, at least one of the following: the number of reviews, the basic information completeness feature, the predetermined identification feature, and the store operation category feature , Consumer scoring characteristics, and more.

Step 53: Input the characteristics of the store to be classified into a classification model to obtain an output result of the classification model. The output of the classification model can be a numerical value or a classification label. When the output of the classification model is a classification label, the classification label may include a real store label and a non-real store label.

As shown in FIG. 4, the features of the store to be classified extracted from the store 41 are input to the input layer 42, where the features include the first feature 421 and the second feature 422 extracted through the review information 411. After passing through the intermediate layer 44, an output result is obtained from the output layer 43.

Step 54: Determine whether the store to be classified is a real store currently according to the output result. When the output result is a classification label, it is directly determined whether the store to be classified is a real store according to the classification label, and the store to be classified with a real store label is a real store, otherwise it is a non-real store. When the output result is a numerical value, if the numerical value is one of two choices, for example, there are only two cases of 1 and 0, the classification label corresponding to whether the store to be classified is a real store exists according to the corresponding value. If there are multiple possible values, for example, any value between 0-1, the classification label of the store to be classified can be determined according to which end the value is biased to. As for which end the value is biased, it can be determined according to a set threshold value. For example, if the threshold value set to 1 is 0.6, values greater than 0.6 are all values biased to 1, which can correspond to the classification labels of non-existing stores.

It is worth noting that, in the method embodiment shown in FIG. 5, the method for classifying a store is performed by using a classification model trained in the embodiment of FIG. 2. Therefore, in the embodiment shown in FIG. The related description is also applicable to the corresponding content of the store to be classified mentioned in the embodiment shown in FIG. 5, and details are not described herein again.

According to an embodiment of another aspect, a training device for a classification model is also provided. FIG. 6 shows a schematic block diagram of a training apparatus for a classification model according to an embodiment. As shown in FIG. 6, the apparatus 600 for training a classification model includes a selection unit 61 configured to select a predetermined number of store samples. The store samples correspond to store information and classification labels. The classification labels include real store labels and non-real ones. There is a shop tag, and the shop information includes review information; the extraction unit 62 is configured to extract features of the shop sample based on the shop information, wherein the aforementioned features include at least a first feature and a second feature, and the first feature is based at least on time-related attributes of the review information And obtained, the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; the training unit 63 is configured to train a classification model based on the characteristics and classification tags of each store sample.

It can be understood that the store sample may include a positive sample and a negative sample, where the positive sample corresponds to a real store label and the negative sample corresponds to a non-real store label. Further, in one embodiment, the selection unit 61 may be configured to select, as a positive sample, stores that have at least one of the following behaviors within a predetermined period: sales vouchers, group purchase activities, promotional activities, reservation services, Q & A interactions, advertisements Place and receive customer sign-in on the client. In another embodiment, the selecting unit 61 may be further configured to: select a store that meets the following conditions as a negative sample: it is marked as permanently closed on the electronic map.

According to an embodiment of the aspect, the first feature may include one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.

According to an embodiment of the other aspect, when extracting the second feature, the extraction unit 62 may further include: a review information acquisition module configured to acquire the first review information of the first store sample; a semantic label determination module configured to utilize a pre-trained The semantic model determines the semantic tags corresponding to each piece of review data in the first review information, wherein the semantic tags include closed or non-closed semantics; a second feature determination module configured to determine the The second feature. It is worth noting that the "first" and "second" in the "first feature" and "second feature" are only used to distinguish between two different features, and do not indicate a sequence limitation.

Further, the second feature determination module may be further configured to: in a case where each semantic tag corresponding to the first store sample includes a tag with a closing semantics, determine the second feature of the first store sample as including that the store is not Store semantics really exist. The "first" in the "first store sample" and "first review information" referred to here means "some", "one", "any", and the corresponding relationship between the store sample and the review information, It does not indicate order or distinction between store samples.

The semantic label determination module may be further configured to: for the first review data in the first review information, each word in the first review data is represented as each word vector through an unsupervised word vector model; based on each word vector, determine A first review vector corresponding to the first review data; inputting the first review vector into the semantic model to obtain an output result of the semantic model; and adding a semantic label to the first review data according to the output result.

In a possible implementation manner, the above-mentioned features may further include, but are not limited to, at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.

According to a possible design, the store samples include training samples and test samples, and the training unit 63 may include: a training module configured to take features of each training sample as input, and according to an output result of the classification model and the classification label In comparison, adjust each classification parameter of the classification model to train the classification model; the test module is configured to input the characteristics of each test sample into the classification model trained by the training sample, and detect the classification using the classification label corresponding to the test sample The accuracy of each output result of the model to obtain the detection result of the classification model; the adjustment module is configured to adjust the classification model according to the detection result if the detection result does not satisfy a preset condition. For example, adjust the grid structure of the classification model, change the classification model, and so on. The preset condition here may be an evaluation parameter condition for the classification model. For example, when the classification model is a gradient boosted decision tree GBDT model, the model evaluation parameters may include the area under the curve, AUC, accuracy, recall rate, F1 score, and so on.

Through the above devices, it is possible to make full use of Internet data and train a classification model with higher accuracy, thereby improving the effectiveness of store classification.

It is worth noting that the apparatus 600 shown in FIG. 6 corresponds to the method shown in FIG. 2. Therefore, the related description in FIG. 2 is also applicable to the apparatus 600, and details are not described herein again.

According to an embodiment of still another aspect, a device for classifying a store is also provided. Fig. 7 shows a schematic block diagram for a store classification device according to one embodiment. As shown in FIG. 7, the apparatus 700 for sorting a store includes: an obtaining unit 71 configured to obtain store information of a store to be classified, wherein the store information includes review information; and an extracting unit 72 configured to extract the store to be classified based on the store information Feature, wherein the feature includes at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; The classification unit 73 is configured to input characteristics of the store to be classified into the model to obtain an output result of the classification model; and the determination unit 74 is configured to determine whether the store to be classified is a real store currently based on the output result.

In a possible design, the first feature may include one or more of the following: the time of the most recent comment, the length of time of the most recent comment from the current time, and the increment of the number of comments within a predetermined time period.

According to an embodiment, the second feature may be extracted by: obtaining first review information of a first store sample; and using a pre-trained semantic model to determine a semantic label corresponding to each piece of review data in the first review information, where The semantic tags include the semantics of going out of business or no semantics of going out of business; the second feature of the first store sample is determined according to each semantic tag. Further, in one embodiment, in a case where each semantic tag corresponding to the first store sample includes a tag with a closing semantics, the second feature of determining the first store sample is to include the semantics that the store is not a real store .

In a possible embodiment, using a pre-trained semantic model to determine the semantic label of each piece of review data in the review information includes: for the first review data in the first review information, using the unsupervised word vector model to convert the first review data Each word in the word is represented as each word vector; based on each word vector, a first review vector corresponding to the first review data is determined; the first review vector is input to a semantic model to obtain an output result of the meaning model; according to the output result, The first review data is semantically tagged.

In one embodiment, the above features may further include at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.

Through the above devices, the Internet data can be fully utilized to extract effective classification features, thereby improving the effectiveness of store classification.

It is worth noting that the apparatus 700 shown in FIG. 7 corresponds to the method shown in FIG. 5. Therefore, the related description in FIG. 5 is also applicable to the apparatus 700, and details are not described herein again.

According to another embodiment of the present invention, there is also provided a computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2 or FIG. 5.

According to an embodiment of still another aspect, a computing device is further provided, which includes a memory and a processor. The memory stores executable code. When the processor executes the executable code, the implementation is implemented in combination with FIG. 2 or FIG. 5. The method described.

Those skilled in the art should appreciate that, in one or more of the above examples, the functions described in the present invention may be implemented by hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored in or transmitted over as one or more instructions or code on a computer-readable medium.

The specific embodiments described above further describe the objectives, technical solutions, and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the present invention. The scope of protection, any modification, equivalent replacement, and improvement made on the basis of the technical solution of the present invention shall be included in the scope of protection of the present invention.

Claims

A training method for a classification model, the classification model is used to determine whether a store is currently a real store, and the method includes:

Selecting a predetermined number of store samples, the store samples corresponding to store information and classification labels, the classification labels including real store labels and non-real store labels, and the store information including review information;

Extracting a feature of the shop sample based on the shop information, wherein the feature includes at least a first feature and a second feature, the first feature is obtained based on at least a time-related attribute of the review information, and the second The characteristics are determined based on the semantic description related to the authenticity of the store included in the review information;

The classification model is trained based on the features and the classification labels of each store sample.
The method of claim 1, wherein selecting a predetermined number of store samples comprises:

Select a store that has at least one of the following behaviors within the predetermined period as a positive sample: sales of vouchers, group purchases, promotions, reservation services, Q & A interaction, advertising, and receipt of customers' check-ins on the client, where the positive samples Corresponds to the actual store label.
The method of claim 1, wherein selecting a predetermined number of store samples comprises:

A store that meets the following conditions is selected as a negative sample: it is marked as permanently closed on the electronic map, and the negative sample corresponds to a non-real existence store label.
The method according to claim 1, wherein the first characteristic comprises one or more of the following: time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.
The method according to claim 1, wherein the second feature is extracted by the following method:

Acquiring first review information corresponding to a first store sample;

Determining a semantic tag corresponding to each piece of review data in the first review information by using a pre-trained semantic model, wherein the semantic tag includes a semantic that has closed or no semantics that is closed;

The second feature of the first store sample is determined according to each semantic tag.
The method according to claim 5, wherein determining the second feature of the first store sample according to the respective semantic tags comprises:

In the case where each semantic tag includes a tag with a closing semantics, it is determined that the second feature of the first store sample is to include the semantics that the store is a non-real existence store.
The method of claim 5, wherein the semantic model comprises a supervised model trained on annotated comment data sets.
The method according to claim 5, wherein using a pre-trained semantic model to determine the semantic labels corresponding to each piece of comment data in the first comment information comprises:

For the first review data in the first review information, each word in the first review data is represented as each word vector through an unsupervised word vector model;

Determining a first review vector corresponding to the first review data based on the respective word vectors;

Inputting the first comment vector into the semantic model to obtain an output result of the semantic model;

Adding a semantic tag to the first comment data according to the output result.
The method according to claim 1, wherein the characteristics further include at least one of the following characteristics: a number of comments characteristic, a basic information completeness characteristic, a predetermined identification characteristic, a store operation category characteristic, and a consumer scoring characteristic.
The method of claim 1, wherein the store sample further comprises a test sample, and

The method further includes:

Detecting the accuracy of each output result of the classification model for each test sample, so as to obtain the detection result of the classification model according to the accuracy of each output result;

Adjust the classification model according to the detection result until the detection result meets a preset condition.
A method for classifying a store, using the classification model trained in any one of claims 1-10 to determine whether the store is currently a real store, the method includes:

Acquiring store information of a store to be classified, where the store information includes review information;

The characteristics of the store to be classified are extracted based on the store information, wherein the characteristics include at least a first characteristic and a second characteristic, the first characteristic is obtained based at least on the time-related attributes of the review information, and the first The second feature is determined based on the semantic description related to the authenticity of the store included in the review information;

Inputting the characteristics of the store to be classified into the classification model to obtain an output result of the classification model;

It is determined whether the store to be classified is currently a real store according to the output result.
A training device for a classification model. The classification model is used to determine whether a store is a real store. The device includes:

A selection unit configured to select a predetermined number of store samples, the store samples corresponding to store information and classification labels, the classification labels including real store labels and non-real store labels, and the store information including review information;

An extraction unit configured to extract features of the store sample based on the store information, wherein the features include at least a first feature and a second feature, and the first feature is obtained based on at least a time-related attribute of the review information , The second feature is determined based on the semantic description related to the authenticity of the store included in the review information;

A training unit configured to train the classification model based on the features and the classification labels of each store sample.
The apparatus according to claim 12, wherein the selection unit is configured to:

Select a store that has at least one of the following behaviors within the predetermined period as a positive sample: sales of vouchers, group purchases, promotions, reservation services, Q & A interaction, advertisement placement, receipt of customer sign-in on the client, wherein the positive sample Corresponds to the actual store label.
The apparatus according to claim 12, wherein the selection unit is further configured to:

A store that meets the following conditions is selected as a negative sample: it is marked as permanently closed on the electronic map, and the negative sample corresponds to a non-real existence store label.
The device according to claim 12, wherein the first characteristic comprises one or more of the following: time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments within a predetermined time period.
The apparatus according to claim 12, wherein when extracting the second feature, the extraction unit further comprises:

A review information acquisition module configured to acquire first review information corresponding to a first store sample;

A semantic tag determination module configured to determine a semantic tag corresponding to each piece of review data in the first review information by using a pre-trained semantic model, wherein the semantic tag includes a semantics of closed or non-closed semantics;

A second feature determination module is configured to determine a second feature of the first store sample according to each semantic tag.
The apparatus according to claim 16, wherein the second feature determination module is further configured to:

In the case where each semantic tag includes a tag with a closing semantics, it is determined that the second feature of the first store sample is to include the semantics that the store is a non-real existence store.
The apparatus according to claim 16, wherein the semantic label determination module is further configured to:

For the first review data in the first review information, each word in the first review data is represented as each word vector through an unsupervised word vector model;

Determining a first review vector corresponding to the first review data based on the respective word vectors;

Inputting the first comment vector into the semantic model to obtain an output result of the semantic model;

Adding a semantic tag to the first comment data according to the output result.
The device according to claim 12, wherein the characteristics further include at least one of the following characteristics: number of reviews characteristics, basic information completeness characteristics, predetermined identification characteristics, store operation category characteristics, and consumer scoring characteristics.
The apparatus according to claim 12, wherein the store sample further includes a test sample, and

The device further includes:

A testing module configured to detect the accuracy of each output result of the classification model for each test sample, so as to obtain the detection result of the classification model according to the accuracy of each output result;

An adjustment module is configured to adjust the classification model according to the detection result when the detection result does not satisfy a preset condition.
A device for classifying a store, using the classification model trained by any of the training devices of claims 12-20, to determine whether the store is currently a real store, and the device includes:

An obtaining unit configured to obtain store information of a store to be classified, wherein the store information includes review information;

An extraction unit configured to extract features of the store to be classified based on the store information, wherein the features include at least a first feature and a second feature, and the first feature is based on at least a time-related attribute of the review information; Acquiring, the second feature is determined based on a semantic description related to the authenticity of the store included in the review information;

A classification unit configured to input the characteristics of the store to be classified into the classification model to obtain an output result of the classification model;

The determining unit is configured to determine, according to the output result, whether the store to be classified is a store that actually exists.
A computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, the computer is caused to execute the method according to any one of claims 1 to 10, or claim 11 Methods.
A computing device includes a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the processor according to any one of claims 1 to 10 is implemented. The method, or the method of claim 11.