WO2020001106A1 - Procédé d'apprentissage de modèle de classification, et procédé et dispositif de classification de mémoires - Google Patents

Procédé d'apprentissage de modèle de classification, et procédé et dispositif de classification de mémoires Download PDF

Info

Publication number
WO2020001106A1
WO2020001106A1 PCT/CN2019/080022 CN2019080022W WO2020001106A1 WO 2020001106 A1 WO2020001106 A1 WO 2020001106A1 CN 2019080022 W CN2019080022 W CN 2019080022W WO 2020001106 A1 WO2020001106 A1 WO 2020001106A1
Authority
WO
WIPO (PCT)
Prior art keywords
store
feature
information
semantic
review
Prior art date
Application number
PCT/CN2019/080022
Other languages
English (en)
Chinese (zh)
Inventor
谢仁强
马书超
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020001106A1 publication Critical patent/WO2020001106A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns

Definitions

  • One or more embodiments of the present specification relate to the field of computer technology, and in particular, to a training method of a computer classification model, a method and a device of store classification.
  • One or more embodiments of the present specification describe a method and device that can make full use of Internet data, and by extracting effective training features, train a classification model with higher accuracy, and accurately determine which stores are closed when the store is classified. , Thereby improving the effectiveness of store classification.
  • a training method for a classification model is provided.
  • the classification model is used to determine whether a store is currently a real store, including: selecting a predetermined number of store samples, the store samples corresponding to store information and classification A label, the classification label includes a real store label and a non-real store label, the store information includes review information, and features of the store sample are extracted based on the store information, wherein the features include at least a first feature and The second feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; based on each store sample Training the classification model with the features and the classification labels.
  • selecting a predetermined number of store samples includes: selecting, as a positive sample, stores that have at least one of the following behaviors within a predetermined period: sales vouchers, group purchases, promotions, reservation services, Q & A interactions, advertising, A check-in of the customer at the client is received, wherein the positive sample corresponds to a real store label.
  • selecting a predetermined number of store samples includes: selecting a store that meets the following conditions as a negative sample: it is marked as permanently closed on the electronic map, and the negative sample corresponds to a non-real existence store label.
  • the first feature includes one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.
  • the second feature is extracted by: obtaining the first review information corresponding to a first store sample; and using a pre-trained semantic model to determine each piece of review data in the first review information Respectively corresponding semantic labels, wherein the semantic labels include closed semantics or non-closed semantics; and determine the second feature of the first store sample according to each semantic label.
  • determining the second feature of the first store sample according to each semantic tag includes: determining the first store sample in a case where each semantic tag includes a tag with a closing semantics.
  • the second feature is that it contains the semantics that the store is not a real store.
  • the semantic model includes a supervised model trained on a labeled review dataset.
  • using a pre-trained semantic model to determine the corresponding semantic tag of each piece of review data in the first review information includes: for the first review data in the first review information, through unsupervised A word vector model represents each word in the first review data as a respective word vector; based on the respective word vectors, determining a first review vector corresponding to the first review data; and inputting the first review vector The semantic model to obtain an output result of the semantic model; and adding a semantic label to the first comment data according to the output result.
  • the features further include at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.
  • the store sample further includes a test sample
  • the method further includes: detecting the accuracy of each output result of the classification model for each test sample, to obtain according to the accuracy of each output result A detection result of the classification model; and adjusting the classification model according to the detection result until the detection result meets a preset condition.
  • a method for classifying a store using the classification model trained in any of the methods of the first aspect to determine whether a store is currently a real store, the method includes: obtaining store information of a store to be classified, wherein, The store information includes review information; features of the store to be classified are extracted based on the store information, wherein the features include at least a first feature and a second feature, and the first feature is based at least on a time of the review information Related attributes are obtained, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information; the feature of the store to be classified is input into the classification model to obtain the classification An output result of the model; determining whether the store to be classified is a real store currently according to the output result.
  • a training device for a classification model is provided.
  • the classification model is used to determine whether a store is currently a real store.
  • the device includes a selection unit configured to select a predetermined number of store samples.
  • the samples correspond to store information and classification labels, the classification labels including real store labels and non-real store labels, the store information including review information, and an extraction unit configured to extract features of the store sample based on the store information ,
  • the feature includes at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is based on the authenticity with the store contained in the review information It is determined based on the sex-related semantic description; a training unit configured to train the classification model based on the characteristics and the classification tags of each store sample.
  • a device for classifying a store is provided.
  • the classification model trained by the training device of the third aspect is used to determine whether a store is currently a real store.
  • the device includes: an obtaining unit configured to obtain the information of a store to be classified.
  • the store information includes review information
  • an extraction unit configured to extract features of the store to be classified based on the store information, wherein the features include at least a first feature and a second feature, so The first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information
  • the classification unit is configured to set the waiting information
  • the feature of the classified store is input to the classification model to obtain an output result of the classification model
  • a determining unit is configured to determine whether the to-be-categorized store is currently a true store according to the output result.
  • a computer-readable storage medium having stored thereon a computer program, which when executed in a computer, causes the computer to execute the method of the first aspect or the second aspect.
  • a computing device including a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the first aspect or the first aspect is implemented. Two ways.
  • the store information corresponding to the selected store sample includes review information
  • the features of the store sample extracted from the store information include information obtained based on at least time-related attributes of the review information.
  • the first feature and the second feature determined based on the semantic description related to the authenticity of the store included in the review information.
  • the Internet data can be fully utilized to extract effective training features and train a classification model with higher accuracy.
  • the extracted features of the stores to be classified also include the above-mentioned first and second features. In this way, the Internet data can be fully utilized to improve the accuracy of the store classification, and thereby improve the store classification. Effectiveness.
  • FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification.
  • FIG. 2 shows a flowchart of a training method of a classification model according to an embodiment
  • FIG. 3 shows a specific example of the second feature extraction
  • FIG. 4 shows a specific example of the model training process
  • FIG. 5 shows a flowchart of a store classification method according to an embodiment
  • FIG. 6 shows a schematic block diagram of a training device for a classification model according to an embodiment
  • FIG. 7 shows a schematic block diagram of a store classification device according to an embodiment.
  • FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification.
  • users can view store information through client applications, such as map applications, shopping applications, ordering applications, and so on.
  • the client application here can run on various terminal devices with data processing capabilities, such as smart phones, tablet computers, desktop computers, smart watches, and so on.
  • the store information displayed on the client application is provided through the server.
  • the server may be a processing device with a certain data processing capability, or a processing device cluster.
  • the computing platform trains a classification model, and the server uses the classification model to classify the store, determine whether the store is a real store, and display it to the user through a client application.
  • the real existence refers to the fact that the store is a real store, and there is no permanent closure or bankruptcy. It does not include the case of a short (such as two days) suspension of business.
  • the computing platform may be set in a server or a processing device independent of the server, which is not limited in this application.
  • the classification model trained by the computing platform can be reused by the server.
  • the results of the server's classification of the store through the classification model can also be reused.
  • the computing platform may first select a predetermined number of store samples, perform feature extraction on the store samples, and then train a classification model based on the extracted features and known classification results.
  • the store information corresponding to the selected store sample may include review information, so that when the features are extracted, the review information may be used to obtain the first feature based on at least the time-related attributes of the review information, and based on the reviews and information contained in the review information.
  • the second feature is determined by the authenticity-related semantic description. In this way, it is possible to make full use of Internet data, extract effective training features, and train a classification model with higher accuracy.
  • the server uses the classification model trained by the computing platform to classify the stores to be classified.
  • the server may first obtain the corresponding store information of the store to be classified, where the store information includes review information, and then extract the characteristics of the store to be classified based on the store information to input the training model trained by the computing platform to obtain the output result of the classification model. And according to the output result, determine whether the store to be classified is currently a real store.
  • the features extracted by the server to be classified by the server also include the above-mentioned first features and second features extracted from the review information. In this way, it is possible to make full use of Internet data, extract effective features, improve the accuracy of store classification, and thereby make store classification results more effective.
  • the store information sent by the server to the client may include only store information of non-closed stores, or store information of all stores .
  • the store information sent by the server to the client may also include information on whether the store is closed.
  • FIG. 1 only shows a specific implementation scenario of an embodiment disclosed in this specification, but it does not limit the scope of the implementation scenarios of the embodiments of this specification. For example, in another implementation scenario, Including the client in Figure 1, and so on.
  • FIG. 2 shows a flowchart of a training method of a classification model according to an embodiment.
  • the execution subject of the method may be a system, equipment, device, platform or server with certain computing and data processing capabilities, such as the computing platform shown in FIG. 1.
  • the classification model involved in this method can be used to determine whether the store is currently a real store.
  • the method includes the following steps: Step 21: Select a predetermined number of store samples.
  • the store samples correspond to store information and classification labels.
  • the classification labels include real store labels and non-real store labels.
  • the store information includes comments.
  • step 22 extracting the features of the store sample based on the store information, wherein the above features include at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is based on the review information
  • the contained semantic description related to the authenticity of the store is determined; step 23, a classification model is trained based on the characteristics and classification tags of each store sample.
  • a predetermined number of store samples are selected, and the store samples correspond to store information and classification labels.
  • the classification label includes a real store label and a non-real store label. It is understandable that user reviews are often formed by the user ’s intuitive and real experience of the store. There is a real gap between the real store and the non-real store. For example, the non-real store may have no reviews or fewer reviews. . Therefore, the review information may have a large influence on the judgment of the classification of the store. In this way, the store information corresponding to the store sample may include at least review information.
  • the comment information may include comment content, comment time, number of comments, and so on.
  • the store information can be crawled from a predetermined website (for example, XX reviews, etc.) by a web crawler (such as python). For example, you can crawl user registration information or content distribution information in the predetermined website. Then, the store information can be obtained through the type of registered user (such as a store or a consumer) in the user registration information, the type of the content (such as a sale or a purchase) in the content distribution information, and the like. If the type of the published content is sale information, the user who posted the information may be the store side, from which the store name, store location, and review information can be obtained. In practice, you can also search on the electronic map based on information such as store name and store location to determine the classification label of the store. For example, stores that are not searchable on the electronic map are non-existent stores.
  • a sample of the store can also be collected manually offline, for example, by manually checking the store address on the website or map one by one to determine its classification label. At the same time, it can also be performed by phone, search engine, administrative At least one of the management department registration information, etc., to obtain the store information of the corresponding store.
  • the review information in the store information can be obtained by, for example, a phone call, a "question and answer" in a search engine, and the like.
  • store samples of known classification tags may also be obtained through acquisition channels that include more aspects, which are not described in detail here.
  • Store samples can include positive and negative samples. Among them, a positive sample may correspond to a real store label, and a negative sample may correspond to a non-real store label.
  • a store that has at least one of the following behaviors within a predetermined period can be selected as a positive sample: sales of vouchers, group purchase activities, promotional activities (such as discounts, etc.), reservation services, Q & A interaction , Advertising, receiving customer check-ins on the client, etc.
  • some sales methods may be used in store operations, such as selling vouchers, organizing group purchases, organizing promotional activities, etc.
  • Some stores (such as hotels, restaurants, etc.) can provide reservation services, and some stores will be available on related websites ( (Such as travel strategy websites, etc.) to conduct some Q & A interactions with consumers or potential consumers, and some stores will cooperate with some websites to place ads to increase page views or search rankings.
  • some stores can receive customers' check-ins in the store through an application (such as a food review website). If the customer clicks the check-in on the client's store page, the deviation between the check-in location and the store location is within a set distance range (such as 80 meters ), The sign-in is successful.
  • the store that provides the check-in may be a real store, and when the customer visits the store for consumption, the check-in is performed. Therefore, a store that has one of the above behaviors within the current or predetermined period can be determined as a positive sample, and these store samples that are positive samples can be assigned real store label.
  • a store that meets the following conditions may be selected as a negative sample: it is marked as permanently closed on the electronic map.
  • the store will be deleted from the map or marked as permanently closed. Therefore, you can use the store name and store location to search.
  • stores marked as permanently closed for electronic map applications use the electronic map to confirm that the store location is correct, and use them as negative samples, and assign these store samples that are negative samples to be non-real. Shop labels.
  • the store information corresponding to the store sample can also be obtained.
  • the store information may include, for example, a store name, a store address, and the like.
  • the store information may further include, but is not limited to, at least one of the following: basic store information, such as phone number, business hours, whether a wireless network connection is provided (such as wifi connection, etc.); store brand name, such as ⁇ ⁇ Etc .; shop labels given by the website or administrative supervision department, such as overseas food selection, local tourism bureau recommendations, etc .; shop classification, such as food, shopping, hotels, etc.
  • non-real stores are shops that have been permanently closed, and their number is often smaller than real stores.
  • down-sampling the obtained store samples with real store labels can be made to make the number of store samples with real store labels and store samples with non-real store labels approximately equal, for example, 45000 Each.
  • the features of the store sample are extracted based on the store information.
  • the above features include at least a first feature and a second feature.
  • the first feature is obtained based on at least the time-related attributes in the review information
  • the second feature is based on the semantic description related to the authenticity of the store included in the review information. And ok. It is worth noting that the "first” and “second” in the "first feature” and “second feature” are only used to distinguish between two different features, and do not indicate a sequence limitation.
  • the time-related attributes of the comment information may include, but are not limited to, at least one of the following: the time when the comment was posted (such as May 1, 2018, etc.), the length of the comment from the current time (such as 10 hours, 20 days, etc.), and the reservation
  • the number of comments (such as 100) in a time period (such as 2 days) and so on. It can be understood that a real store may constantly have new consumers to consume and comment. Therefore, the latest review time is often late, and the length of the review from the current time is small. At the same time, the number of reviews in the predetermined time period increases. It is more likely; instead of a real store, because there are no new consumers, the review time is earlier, the review is longer than the current time, and the possibility of increasing reviews within a predetermined period is less.
  • the first feature may include, but is not limited to, one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and the increment of the number of comments within a predetermined time period.
  • the latest review time may be the time of the latest review from the current time.
  • the comment time is at 20:00 on March 2, 2015.
  • the length of the latest comment from the current time can be the time difference between the current time and the latest comment time, such as 30 days.
  • the increment of the number of comments in a predetermined time period that is, the amount of change in the total number of comments every predetermined time period. For example, suppose the predetermined time period is 3 months.
  • the comment time count the total number of comments every 3 months from the current time and calculate the increment of the number of comments. If the total number of comments in the last 3 months is 1000, the most recent The 3-month review increment is 1000. In this way, it is possible to make full use of the time-related attribute data of the review information of the shop samples on the Internet.
  • the semantic description related to the authenticity of the store contained in the review information may be a semantic description containing the store closed or in a good business condition. For example, "the store is closed and no longer exists", it may be a semantic description that the store has been permanently closed.
  • different information such as the time of release may also mean different meanings. For example, for a restaurant, a comment "Da Lao Yuan came over and has been closed” may be expressed at 12 pm It means that the restaurant is closed, and the announcement at 12 noon may mean that the restaurant is closed.
  • a very small number of comments (such as 1) that contain the semantics of expressing a shop closure may indicate that the shop has been permanently closed. Therefore, the feature may include a second feature that can reflect whether the review information has a semantic description of the store being permanently closed.
  • the second feature may be expressed in words, for example, having a semantic description of the store permanently closed or including a semantic description related to the authenticity of the store, not having a semantic description of the store permanently closed or not including a semantic description related to the authenticity of the store, and so on.
  • the second feature may also be represented by a numerical value, for example, the second feature is 1 in the case of having a semantic description of the store permanently closed, the second feature is 0 in the case of having no semantic description of the store permanently closed, and so on.
  • the second feature can be extracted by the following methods: step 31, obtaining first review information corresponding to the first store sample; step 32, determining the first review using a pre-trained semantic model Semantic tags corresponding to each piece of review data in the information, wherein the semantic tags include closed or non-closed semantics; step 33, determine the second feature of the first store sample according to each semantic tag.
  • the "first” in the "first store sample” and “first review information” referred to here means “some”, “one of them”, “any one”, and the store samples and reviews Correspondence of information, not the order, or the distinction between store samples.
  • the review information of the store sample may be obtained first.
  • the review information of a shop sample may correspond to one or more pieces of review data.
  • Each review data may include a review content, a review time, and data such as a user ID who posted the review.
  • a pre-trained semantic model is used to determine the semantic label corresponding to each piece of review data in the review information.
  • each piece of comment data can correspond to a semantic tag.
  • Each piece of comment data can be input into a pre-trained semantic model, and the semantic label of a piece of comment data can be determined according to the output of the semantic model.
  • the semantic model can be trained through a pre-annotated comment set.
  • some reviews can be selected from the review data of multiple store samples and added to the review set, especially for review data containing review sentences such as "closed”, “closed”, etc., and determined through manual identification and labeling
  • the semantic labels of these review data are used as known semantic labels to train a supervised model, such as a logistic regression LR (logistics regression) model.
  • Model training is a process of determining model parameters with known inputs (such as comment sentences) and outputs (such as known semantic labels), and will not be repeated here.
  • the semantic label of the review data may include the semantics with or without closing semantics.
  • the output of the semantic model can be one of the semantic labels directly, or it can be a numerical value, such as 1, 0, and so on.
  • the output of the semantic model is one of two possible values (such as 1, 0, etc.), where each value corresponds to a semantic label, such as 1 corresponding to a closed business semantic label.
  • the output of the semantic model can also be one of multiple possible values (such as any decimal between 0-1, etc.).
  • a threshold can be set to determine which semantic label the output value is more biased to, such as greater than 0.6. Prefer to have closed semantic labels.
  • each word in the review data may be first expressed as a respective word vector through an unsupervised word vector model (such as the word2vec model); based on each word vector, it is determined A comment vector corresponding to the comment data; input the determined comment vector into a semantic model to obtain an output result of the semantic model; and add a semantic label to the comment data according to the output result.
  • the review vector corresponding to the review data is determined based on each word vector. For example, the review vector may be an average of different dimensions of each word vector, or a weighted average of different dimensions.
  • each word is represented as a word vector:
  • the comment vector corresponding to the comment data determined based on each word vector may be:
  • the number of occurrences of each vocabulary can also be used as a weight, and a weighted average of different latitudes of each word vector to obtain a comment vector is:
  • the 1 in front of each parameter is the number of occurrences of the corresponding vocabulary
  • the denominator is the sum of the number of occurrences of each vocabulary. In this example, the number of occurrences of each vocabulary is 1 and can be other values in practice.
  • the comment vector can be input to the semantic model, so as to obtain the output of the semantic model. Understandably, the comment vector can also be expressed as Each of them is input as a feature into the semantic model. Then, you can add semantic tags to the comment data according to the output of the semantic model. For example, the output of the semantic model is 1, and a semantic tag of "with closing semantics" is added to the comment data.
  • Step 33 Determine the second feature of the corresponding store sample according to each semantic tag corresponding to the store sample.
  • the second feature may be determined as having a storehouse permanent shutdown semantic description or including a semantic description related to the authenticity of the store, without a store permanent shutdown semantic description or including a storefront-related semantic description, a value of 1, 0, etc .
  • the second feature of determining the store sample is that the store includes a non-real existence The semantics of the store.
  • a number threshold may also be set, and the second characteristic of the store sample is determined only when the number of comment data of tags with the semantics of closing business exceeds the number threshold (such as 10). To include the semantics that the store is not a real store.
  • the characteristics of the shop sample may also include the number of reviews, such as the total number of reviews, the number of positive reviews, the number of positive reviews, the ratio of the number of negative reviews, and reviews. Number of pictures in, etc. It can be understood that for shops with a large proportion of negative reviews, it is more likely to be non-real shops; for shops with a large total number of reviews or a large number of pictures in the reviews, it is likely to be real shops Sex. Therefore, the feature of the number of reviews can be used as a factor that influences whether a store is classified as a real store.
  • the characteristics of the store sample may further include basic information completeness characteristics.
  • Basic information such as telephone, business hours, whether wireless network connection is available (such as wifi, etc.), service facilities and so on. The more complete the basic information is, the more likely it is that the store exists.
  • the basic information completeness may be proportional to the number of basic information items. Therefore, the basic information completeness feature can be used as a factor that influences whether the store is classified as a real store.
  • the characteristics of the store sample may further include predetermined identification characteristics.
  • the predetermined identifier may be, for example, a preferred label (such as a local tourism bureau recommendation label) given by a brand store, a chain store, a website, or an administrative agency. Understandably, brand stores or chain stores often refer to stores with high visibility and market recognition. These stores are more likely to be real stores. Websites or administrative agencies give preferred labels to stores that have passed audits and inspections. These stores are more likely to be real stores. Therefore, the predetermined identification feature can be used as a factor that influences whether the store is classified as a real store.
  • the characteristics of the store sample may further include store operation category characteristics.
  • the store management category may be, for example, food, hotel, clothing, and so on. In some websites, there are many reviews for gourmet shops. If you only classify by the number of reviews, the accuracy is low. Therefore, you can also treat the shops in different business categories differently, and treat the shops with fewer reviews in the business category. , Give greater weight.
  • the characteristics of the store sample may also include consumer scoring characteristics.
  • Consumer ratings can be either points or star ratings. It is worth noting that if the store samples are obtained from the same website and the consumer scores have the same standard, the consumer scores can be directly used as the consumer score characteristics. If the store samples are not obtained from the same website, and the scoring standards may also be different, the ratio of consumer scores to full marks can be used as a consumer scoring feature. Consumer ratings will affect the customer flow of the store. If the customer flow is low, it is more likely to become a non-real store. Therefore, the consumer scoring characteristics can be used to influence whether the store is currently a real store. A factor in classification.
  • the features of the store sample may also include more features, which will not be exemplified here.
  • the classification model is trained based on the characteristics and classification labels of each store sample.
  • the process of model training is the process of determining model parameters based on known input features and classification results.
  • the input feature is the feature of the store sample, where the feature includes multiple input features
  • the classification result is determined according to the classification label of the store sample.
  • the output result includes 0, 1, 0 is a real store label, and 1 is not real.
  • Store labels exist, and so on.
  • a store sample corresponds to a set of known input features and classification results.
  • the known input features input in the input layer 42 are the characteristics of each store sample, and the output results of the output layer 43 can be compared with the classification labels of the corresponding store samples. According to the comparison result, various parameters of the intermediate layer 44 are adjusted, and weight parameters represented by the arrows between the features of the input layer 42 and the intermediate layer 44 and between the arrows of the intermediate layer 44 and the output layer 43 are represented.
  • the known input features input by the input layer 42 include a first feature 421 and a second feature 422, and the first feature 421 and the second feature 422 are respectively obtained from the review information 411 related data in the store information 41.
  • store samples can be divided into training samples and test samples.
  • the features of each training sample are used as input in order, and each classification parameter of the classification model is adjusted according to the comparison between the output of the classification model and the classification label, so that the output of the classification model is classified with the currently input training sample.
  • the labels are more consistent to train the classification model.
  • the features of each test sample are input into the classification model trained by the training sample, and the classification labels corresponding to the test samples are used to detect the accuracy of each output result of the classification model to obtain the detection result of the classification model. For example, if the output of the classification label and the classification model are consistent, it is determined that the output of the classification model is correct. In this way, the detection results of the classification model on the entire test sample, such as accuracy, can be obtained.
  • the classification model may be further adjusted according to the detection result. For example, adjust the grid structure of the classification model, change the classification model, and so on. For example, when the classification model is a GBDT model of gradient boosted decision tree, the number of trees, the depth of each tree, and the learning rate can be adjusted. After adjusting the classification model, use the training samples to train the classification model again, and use the test samples to obtain the detection results of the classification model. Until the test sample meets the preset conditions.
  • the preset condition here may be a condition set on a detection result of the classification model.
  • the detection result may include values of the area under the curve, AUC, accuracy, recall, F1 score, and so on.
  • the preset conditions are that the accuracy and recall rate are both greater than 0.7 and so on.
  • AUC 0.868
  • accuracy 0.767
  • recall rate 0.803
  • F1 is 0.784.
  • the store information corresponding to the selected store sample includes review information. Therefore, the features extracted from the store information may include at least: a first feature obtained based on the time-related attributes of the review information, based on the review information The second feature determined by the semantic description related to the authenticity of the store. In this way, training a classification model based on features including the first feature and the second feature can make full use of Internet data to train a classification model with higher accuracy, thereby improving the effectiveness of store classification.
  • a method for classifying a store is also provided. It is used to determine whether the store is a real store through a classification model. This method is suitable for an electronic device with a certain data processing capability, such as the server in FIG. 1.
  • the embodiment of the method for classifying a store includes the steps of: step 51, obtaining store information of a store to be classified, where the store information includes review information; step 52, extracting characteristics of the store to be classified based on the store information,
  • the feature includes at least a first feature and a second feature.
  • the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information;
  • step 53 Inputting the characteristics of the store to be classified into the classification model to obtain an output result of the classification model;
  • step 54 determine whether the store to be classified is currently a true store according to the output result.
  • step 51 store information of a store to be classified is obtained.
  • the store information includes at least review information, such as review content, review time, and number of reviews.
  • the store information may also include but is not limited to at least one of the following: basic store information, store brand name, store label given by the website or administrative supervision department, store classification, etc.
  • Store information can be crawled from a predetermined website (such as ⁇ comments, etc.) through a web crawler (such as python).
  • the features of the store to be classified are extracted based on the store information.
  • the features here correspond to the input features of the classification model.
  • the feature includes at least a first feature and a second feature.
  • the first feature is obtained based on at least the time-related attributes of the review information
  • the second feature is determined based on the semantic description related to the authenticity of the store included in the review information. It is worth noting that the "first" and “second” in the "first feature” and “second feature” are only used to distinguish between two different features, and do not indicate a sequence limitation.
  • the time-related attributes of the review information may include, but are not limited to, at least one of the following: a review posting time, a duration of the review from the current time, a number of reviews in a predetermined time period, and the like.
  • the first feature may include, but is not limited to, one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and the increment of the number of comments within a predetermined time period. In this way, it is possible to make full use of the time-related attribute data of the review information of the shop samples on the Internet.
  • the semantic description related to the authenticity of the store contained in the review information may be a semantic description containing the store closed or in a good business condition.
  • a very small number of comments such as one
  • the second feature can be expressed in words or numerically.
  • the second feature can be extracted by: obtaining the review information of the store to be classified; using a pre-trained semantic model to determine the corresponding semantic tag of each piece of review data in the review information, wherein the semantic tag includes a closed Semantic or does not have closing semantics; the second feature of the store to be classified is determined according to each semantic tag corresponding to the store to be classified.
  • each review data may include a review content, a review time, and data such as a user ID who posted the review.
  • Each piece of review data can be input into a pre-trained semantic model, and the semantic label of each piece of review data is determined based on the output of the semantic model. Then, the second feature of the store to be classified is determined according to these semantic tags.
  • each word in the review data may be first expressed as a respective word vector through an unsupervised word vector model (such as the word2vec model); based on each word vector, it is determined A comment vector corresponding to the comment data; input the determined comment vector into a semantic model to obtain an output result of the semantic model; and add a semantic label to the comment data according to the output result.
  • an unsupervised word vector model such as the word2vec model
  • the second feature of determining the store to be classified is to include the semantics that the store is not a real store .
  • a number threshold may also be set. When the number of comment data of tags with a closing semantics exceeds the number threshold, it is determined that the second characteristic of the store sample is that the store is not real. The semantics of the store.
  • the characteristics of the store to be classified may include, but are not limited to, at least one of the following: the number of reviews, the basic information completeness feature, the predetermined identification feature, and the store operation category feature , Consumer scoring characteristics, and more.
  • Step 53 Input the characteristics of the store to be classified into a classification model to obtain an output result of the classification model.
  • the output of the classification model can be a numerical value or a classification label.
  • the classification label may include a real store label and a non-real store label.
  • the features of the store to be classified extracted from the store 41 are input to the input layer 42, where the features include the first feature 421 and the second feature 422 extracted through the review information 411. After passing through the intermediate layer 44, an output result is obtained from the output layer 43.
  • Step 54 Determine whether the store to be classified is a real store currently according to the output result.
  • the output result is a classification label
  • the output result is directly determined whether the store to be classified is a real store according to the classification label, and the store to be classified with a real store label is a real store, otherwise it is a non-real store.
  • the output result is a numerical value
  • the classification label corresponding to whether the store to be classified is a real store exists according to the corresponding value.
  • the classification label of the store to be classified can be determined according to which end the value is biased to.
  • it can be determined according to a set threshold value. For example, if the threshold value set to 1 is 0.6, values greater than 0.6 are all values biased to 1, which can correspond to the classification labels of non-existing stores.
  • the method for classifying a store is performed by using a classification model trained in the embodiment of FIG. 2. Therefore, in the embodiment shown in FIG. The related description is also applicable to the corresponding content of the store to be classified mentioned in the embodiment shown in FIG. 5, and details are not described herein again.
  • FIG. 6 shows a schematic block diagram of a training apparatus for a classification model according to an embodiment.
  • the apparatus 600 for training a classification model includes a selection unit 61 configured to select a predetermined number of store samples.
  • the store samples correspond to store information and classification labels.
  • the classification labels include real store labels and non-real ones.
  • the shop information includes review information
  • the extraction unit 62 is configured to extract features of the shop sample based on the shop information, wherein the aforementioned features include at least a first feature and a second feature, and the first feature is based at least on time-related attributes of the review information And obtained, the second feature is determined based on the semantic description related to the authenticity of the store included in the review information
  • the training unit 63 is configured to train a classification model based on the characteristics and classification tags of each store sample.
  • the store sample may include a positive sample and a negative sample, where the positive sample corresponds to a real store label and the negative sample corresponds to a non-real store label.
  • the selection unit 61 may be configured to select, as a positive sample, stores that have at least one of the following behaviors within a predetermined period: sales vouchers, group purchase activities, promotional activities, reservation services, Q & A interactions, advertisements Place and receive customer sign-in on the client.
  • the selecting unit 61 may be further configured to: select a store that meets the following conditions as a negative sample: it is marked as permanently closed on the electronic map.
  • the first feature may include one or more of the following: the time of the latest comment, the length of the latest comment from the current time, and an increase in the number of comments in a predetermined time period.
  • the extraction unit 62 may further include: a review information acquisition module configured to acquire the first review information of the first store sample; a semantic label determination module configured to utilize a pre-trained The semantic model determines the semantic tags corresponding to each piece of review data in the first review information, wherein the semantic tags include closed or non-closed semantics; a second feature determination module configured to determine the The second feature. It is worth noting that the "first" and “second” in the "first feature" and “second feature” are only used to distinguish between two different features, and do not indicate a sequence limitation.
  • the second feature determination module may be further configured to: in a case where each semantic tag corresponding to the first store sample includes a tag with a closing semantics, determine the second feature of the first store sample as including that the store is not Store semantics really exist.
  • the "first” in the "first store sample” and “first review information” referred to here means “some”, “one”, “any”, and the corresponding relationship between the store sample and the review information, It does not indicate order or distinction between store samples.
  • the semantic label determination module may be further configured to: for the first review data in the first review information, each word in the first review data is represented as each word vector through an unsupervised word vector model; based on each word vector, determine A first review vector corresponding to the first review data; inputting the first review vector into the semantic model to obtain an output result of the semantic model; and adding a semantic label to the first review data according to the output result.
  • the above-mentioned features may further include, but are not limited to, at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.
  • the store samples include training samples and test samples
  • the training unit 63 may include: a training module configured to take features of each training sample as input, and according to an output result of the classification model and the classification label In comparison, adjust each classification parameter of the classification model to train the classification model; the test module is configured to input the characteristics of each test sample into the classification model trained by the training sample, and detect the classification using the classification label corresponding to the test sample The accuracy of each output result of the model to obtain the detection result of the classification model; the adjustment module is configured to adjust the classification model according to the detection result if the detection result does not satisfy a preset condition. For example, adjust the grid structure of the classification model, change the classification model, and so on.
  • the preset condition here may be an evaluation parameter condition for the classification model.
  • the model evaluation parameters may include the area under the curve, AUC, accuracy, recall rate, F1 score, and so on.
  • the apparatus 600 shown in FIG. 6 corresponds to the method shown in FIG. 2. Therefore, the related description in FIG. 2 is also applicable to the apparatus 600, and details are not described herein again.
  • a device for classifying a store is also provided.
  • Fig. 7 shows a schematic block diagram for a store classification device according to one embodiment.
  • the apparatus 700 for sorting a store includes: an obtaining unit 71 configured to obtain store information of a store to be classified, wherein the store information includes review information; and an extracting unit 72 configured to extract the store to be classified based on the store information Feature, wherein the feature includes at least a first feature and a second feature, the first feature is obtained based on at least the time-related attributes of the review information, and the second feature is determined based on the semantic description related to the authenticity of the store included in the review information;
  • the classification unit 73 is configured to input characteristics of the store to be classified into the model to obtain an output result of the classification model; and the determination unit 74 is configured to determine whether the store to be classified is a real store currently based on the output result.
  • the first feature may include one or more of the following: the time of the most recent comment, the length of time of the most recent comment from the current time, and the increment of the number of comments within a predetermined time period.
  • the second feature may be extracted by: obtaining first review information of a first store sample; and using a pre-trained semantic model to determine a semantic label corresponding to each piece of review data in the first review information, where
  • the semantic tags include the semantics of going out of business or no semantics of going out of business; the second feature of the first store sample is determined according to each semantic tag.
  • the second feature of determining the first store sample is to include the semantics that the store is not a real store .
  • using a pre-trained semantic model to determine the semantic label of each piece of review data in the review information includes: for the first review data in the first review information, using the unsupervised word vector model to convert the first review data Each word in the word is represented as each word vector; based on each word vector, a first review vector corresponding to the first review data is determined; the first review vector is input to a semantic model to obtain an output result of the meaning model; according to the output result, The first review data is semantically tagged.
  • the above features may further include at least one of the following features: the number of reviews feature, the basic information completeness feature, the predetermined identification feature, the store operation category feature, and the consumer scoring feature.
  • the Internet data can be fully utilized to extract effective classification features, thereby improving the effectiveness of store classification.
  • FIG. 7 corresponds to the method shown in FIG. 5. Therefore, the related description in FIG. 5 is also applicable to the apparatus 700, and details are not described herein again.
  • a computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2 or FIG. 5.
  • a computing device which includes a memory and a processor.
  • the memory stores executable code.
  • the processor executes the executable code, the implementation is implemented in combination with FIG. 2 or FIG. 5. The method described.
  • the functions described in the present invention may be implemented by hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored in or transmitted over as one or more instructions or code on a computer-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé d'apprentissage de modèle de classification, ainsi qu'un procédé et un dispositif de classification de mémoires. Au cours de l'apprentissage d'un modèle de classification, des informations de mémoire correspondant à un échantillon de mémoire sélectionné comprennent des informations d'analyse. Les informations de mémoire sont utilisées pour extraire des caractéristiques de l'échantillon de mémoire, les caractéristiques comprenant : une première caractéristique obtenue au moins d'après un attribut temporel des informations d'analyse ; et une seconde caractéristique déterminée d'après une description sémantique comprise dans les informations d'analyse et relative à l'existence ou à la non-existence de la mémoire. Lorsque le modèle de classification appris est utilisé pour effectuer une classification de magasin, les caractéristiques extraites des mémoires à classer comprennent également la première caractéristique et la seconde caractéristique. De cette manière, les données Internet peuvent être pleinement exploitées pour améliorer l'efficacité de classification de mémoires.
PCT/CN2019/080022 2018-06-25 2019-03-28 Procédé d'apprentissage de modèle de classification, et procédé et dispositif de classification de mémoires WO2020001106A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810662702.4A CN108985347A (zh) 2018-06-25 2018-06-25 分类模型的训练方法、店铺分类的方法及装置
CN201810662702.4 2018-06-25

Publications (1)

Publication Number Publication Date
WO2020001106A1 true WO2020001106A1 (fr) 2020-01-02

Family

ID=64538738

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/080022 WO2020001106A1 (fr) 2018-06-25 2019-03-28 Procédé d'apprentissage de modèle de classification, et procédé et dispositif de classification de mémoires

Country Status (3)

Country Link
CN (1) CN108985347A (fr)
TW (1) TW202001736A (fr)
WO (1) WO2020001106A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625721A (zh) * 2020-05-26 2020-09-04 汉海信息技术(上海)有限公司 内容推荐方法及装置
CN112328899A (zh) * 2020-11-27 2021-02-05 京东数字科技控股股份有限公司 信息处理方法、信息处理装置、存储介质与电子设备
CN112561530A (zh) * 2020-12-25 2021-03-26 民生科技有限责任公司 一种基于多模型融合的交易流水处理方法及系统
CN115131068A (zh) * 2022-07-08 2022-09-30 连连(杭州)信息技术有限公司 一种店铺分类方法、装置和计算机存储介质
CN118036602A (zh) * 2023-08-14 2024-05-14 广东数鼎科技有限公司 一种虚假评论识别方法及装置

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985347A (zh) * 2018-06-25 2018-12-11 阿里巴巴集团控股有限公司 分类模型的训练方法、店铺分类的方法及装置
CN109685555A (zh) * 2018-12-13 2019-04-26 拉扎斯网络科技(上海)有限公司 商户筛选方法、装置、电子设备及存储介质
CN109697637B (zh) * 2018-12-27 2022-08-26 拉扎斯网络科技(上海)有限公司 对象类别确定方法、装置、电子设备及计算机存储介质
CN109840831A (zh) * 2019-01-29 2019-06-04 浙江口碑网络技术有限公司 页面呈现方法及装置
CN109993545B (zh) * 2019-02-01 2024-07-19 创新先进技术有限公司 实体店的验真方法和装置
CN110334306A (zh) * 2019-06-21 2019-10-15 无线生活(北京)信息技术有限公司 标签处理方法及装置
CN111008331B (zh) * 2019-11-29 2023-09-15 拉扎斯网络科技(上海)有限公司 门店端的展示方法、装置、电子设备及存储介质
CN111368761B (zh) * 2020-03-09 2022-12-16 腾讯科技(深圳)有限公司 店铺营业状态识别方法、装置、可读存储介质和设备
CN114339859B (zh) * 2020-09-27 2023-08-15 中国移动通信集团广东有限公司 识别全屋无线网络WiFi潜在用户的方法、装置及电子设备
CN114519114B (zh) * 2020-11-20 2024-08-13 北京达佳互联信息技术有限公司 多媒体资源分类模型构建方法、装置、服务器及存储介质
CN113449169B (zh) * 2021-09-01 2021-12-14 广州越创智数信息科技有限公司 一种基于rpa的舆情数据获取方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108111A1 (en) * 2012-10-12 2014-04-17 Redpixtec. Gmbh Mobile advertising system
CN105095387A (zh) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 基于用户评论信息的poi数据采集方法及装置
CN105808679A (zh) * 2016-03-02 2016-07-27 陈健强 一种基于电子地图的店家营业状态标记实现方法及系统
CN107092641A (zh) * 2017-02-27 2017-08-25 口碑控股有限公司 店铺营业状态的判断方法和装置、店铺搜索的方法和装置
CN108985347A (zh) * 2018-06-25 2018-12-11 阿里巴巴集团控股有限公司 分类模型的训练方法、店铺分类的方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866542B (zh) * 2015-05-05 2018-07-06 腾讯科技(深圳)有限公司 一种poi数据验证方法和装置
CN108197177B (zh) * 2017-12-21 2019-12-17 北京三快在线科技有限公司 业务对象的监测方法、装置、存储介质和计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108111A1 (en) * 2012-10-12 2014-04-17 Redpixtec. Gmbh Mobile advertising system
CN105095387A (zh) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 基于用户评论信息的poi数据采集方法及装置
CN105808679A (zh) * 2016-03-02 2016-07-27 陈健强 一种基于电子地图的店家营业状态标记实现方法及系统
CN107092641A (zh) * 2017-02-27 2017-08-25 口碑控股有限公司 店铺营业状态的判断方法和装置、店铺搜索的方法和装置
CN108985347A (zh) * 2018-06-25 2018-12-11 阿里巴巴集团控股有限公司 分类模型的训练方法、店铺分类的方法及装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625721A (zh) * 2020-05-26 2020-09-04 汉海信息技术(上海)有限公司 内容推荐方法及装置
CN111625721B (zh) * 2020-05-26 2023-12-22 汉海信息技术(上海)有限公司 内容推荐方法及装置
CN112328899A (zh) * 2020-11-27 2021-02-05 京东数字科技控股股份有限公司 信息处理方法、信息处理装置、存储介质与电子设备
CN112328899B (zh) * 2020-11-27 2024-04-16 京东科技控股股份有限公司 信息处理方法、信息处理装置、存储介质与电子设备
CN112561530A (zh) * 2020-12-25 2021-03-26 民生科技有限责任公司 一种基于多模型融合的交易流水处理方法及系统
CN115131068A (zh) * 2022-07-08 2022-09-30 连连(杭州)信息技术有限公司 一种店铺分类方法、装置和计算机存储介质
CN115131068B (zh) * 2022-07-08 2023-12-26 连连(杭州)信息技术有限公司 一种店铺分类方法、装置和计算机存储介质
CN118036602A (zh) * 2023-08-14 2024-05-14 广东数鼎科技有限公司 一种虚假评论识别方法及装置

Also Published As

Publication number Publication date
CN108985347A (zh) 2018-12-11
TW202001736A (zh) 2020-01-01

Similar Documents

Publication Publication Date Title
WO2020001106A1 (fr) Procédé d'apprentissage de modèle de classification, et procédé et dispositif de classification de mémoires
CN108154401B (zh) 用户画像刻画方法、装置、介质和计算设备
US20240062271A1 (en) Recommendations Based Upon Explicit User Similarity
CN107798571B (zh) 恶意地址/恶意订单的识别系统、方法及装置
US8600796B1 (en) System, method and computer program product for identifying products associated with polarized sentiments
US8818788B1 (en) System, method and computer program product for identifying words within collection of text applicable to specific sentiment
CN110135901A (zh) 一种企业用户画像构建方法、系统、介质和电子设备
CN112269805B (zh) 数据处理方法、装置、设备及介质
CN109118316B (zh) 线上店铺真实性的识别方法和装置
US20160140627A1 (en) Generating high quality leads for marketing campaigns
US20130332385A1 (en) Methods and systems for detecting and extracting product reviews
US20180108029A1 (en) Detecting differing categorical features when comparing segments
CN109816134B (zh) 收货地址预测方法、装置以及存储介质
CN107832338B (zh) 一种识别核心产品词的方法和系统
US20180285748A1 (en) Performance metric prediction for delivery of electronic media content items
US11487835B2 (en) Information processing system, information processing method, and program
Chen et al. Big data analytics on aviation social media: The case of china southern airlines on sina weibo
KR101784559B1 (ko) 사용자의 소비 패턴/관심사 분석 방법 및 장치
CN111091409B (zh) 客户标签的确定方法、装置和服务器
CN116029637A (zh) 跨境电商物流渠道智能推荐方法及装置、设备、存储介质
Zhao et al. Online comments of multi-category commodities based on emotional tendency analysis
JP2020057206A (ja) 情報処理装置
CN113779276A (zh) 用于检测评论的方法和装置
CN112784021A (zh) 用于使用从评论提取的关键字的系统和方法
US20150073902A1 (en) Financial Transaction Analytics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19827203

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19827203

Country of ref document: EP

Kind code of ref document: A1