CN108959289B - Website category acquisition method and device - Google Patents

Website category acquisition method and device Download PDF

Info

Publication number
CN108959289B
CN108959289B CN201710351636.4A CN201710351636A CN108959289B CN 108959289 B CN108959289 B CN 108959289B CN 201710351636 A CN201710351636 A CN 201710351636A CN 108959289 B CN108959289 B CN 108959289B
Authority
CN
China
Prior art keywords
website
data sets
access data
order data
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710351636.4A
Other languages
Chinese (zh)
Other versions
CN108959289A (en
Inventor
林霞霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710351636.4A priority Critical patent/CN108959289B/en
Publication of CN108959289A publication Critical patent/CN108959289A/en
Application granted granted Critical
Publication of CN108959289B publication Critical patent/CN108959289B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/07Guided tours

Abstract

The application discloses a website category acquisition method and device. One embodiment of the method comprises: acquiring an order data set and an access data set of a target website in a first preset time period; analyzing the order data set and the access data set, selecting order data from the order data set to generate a target order data set, and selecting access data from the access data set to generate a target access data set; extracting a characteristic vector from the target order data set and the target access data set; and inputting the feature vectors into a pre-trained website classification model for classification to obtain a secondary category of the target website, wherein the website classification model is used for representing the corresponding relation between the feature vectors of the website and the secondary category of the website. The embodiment improves the website classification efficiency.

Description

Website category acquisition method and device
Technical Field
The application relates to the technical field of computers, in particular to the technical field of internet, and particularly relates to a website category acquisition method and device.
Background
With the popularization of the internet, the advantages of online shopping are more prominent. The scale of users who shop on the internet is increasing, and various types of websites (e.g., online stores) are also coming up endlessly.
There may be different business models for the same type of web site. The same type of web site may also be divided into different categories according to different business models.
However, the existing website classification method is generally that a person skilled in the art classifies websites through manual analysis, and the website classification efficiency is low.
Disclosure of Invention
An object of the embodiments of the present application is to provide an improved method and apparatus for acquiring a website category, so as to solve the technical problems mentioned in the above background.
In a first aspect, an embodiment of the present application provides a method for acquiring a website category, where the method includes: acquiring an order data set and an access data set of a target website in a first preset time period; analyzing the order data set and the access data set, selecting order data from the order data set to generate a target order data set, and selecting access data from the access data set to generate a target access data set; extracting a characteristic vector from the target order data set and the target access data set; and inputting the feature vectors into a pre-trained website classification model for classification to obtain a secondary category of the target website, wherein the website classification model is used for representing the corresponding relation between the feature vectors of the website and the secondary category of the website.
In some embodiments, the feature vector comprises at least one of: the order amount of the target website, the visitor volume of the target website and the browsing volume of the target website.
In some embodiments, after inputting the feature vector into a pre-trained website classification model for classification to obtain a secondary category of the target website, the method further includes: inquiring a first corresponding relation table to obtain a first class to which a second class of the target website belongs, wherein the first corresponding relation table is used for storing the second class and the first class to which the second class belongs; acquiring an initial primary category submitted by a target website during registration; determining whether the primary category to which the secondary category of the target website belongs is the same as the initial primary category; if not, outputting abnormal prompt information.
In some embodiments, after inputting the feature vector into a pre-trained website classification model for classification to obtain a secondary category of the target website, the method further includes: inquiring a second corresponding relation table to obtain the order taking peak time period corresponding to the secondary category of the target website, wherein the second corresponding relation table is used for storing the secondary category and the order taking peak time period corresponding to the secondary category; and outputting the order-off peak time period corresponding to the secondary category of the target website.
In some embodiments, the method further comprises the step of building a website classification model, the step of building a website classification model comprising: respectively acquiring order data sets and access data sets of a plurality of websites in a second preset time period; analyzing order data sets and access data sets of a plurality of websites, selecting order data from the order data sets of the websites to generate a plurality of sample order data sets, and selecting access data from the access data sets of the websites to generate a plurality of sample access data sets; extracting a plurality of sample feature vectors from the plurality of sample order data sets and the plurality of sample access data sets respectively; and clustering the plurality of sample feature vectors to obtain a website classification model.
In some embodiments, analyzing the order data sets and the visit data sets of the plurality of web sites, selecting order data from the order data sets of the plurality of web sites to generate a plurality of sample order data sets, selecting visit data from the visit data sets of the plurality of web sites to generate a plurality of sample visit data sets, comprises: deleting the order data and the access data of the multiple websites with missing fields to obtain a first order data set and a first access data set of the multiple websites; respectively carrying out duplicate removal processing on the first order data sets and the first access data sets of the multiple websites to obtain second order data sets and second access data sets of the multiple websites; denoising the second order data sets and the second access data sets of the multiple websites based on the preset number of the first clusters to obtain multiple sample order data sets and multiple sample access data sets.
In some embodiments, extracting a plurality of sample feature vectors from the plurality of sample order data sets and the plurality of sample access data sets, respectively, comprises: respectively carrying out normalization processing on the plurality of sample order data sets and the plurality of sample access data sets to obtain a plurality of normalized sample order data sets and a plurality of normalized sample access data sets; first order derivative sets corresponding to the plurality of normalized sample order data sets and first order derivative sets corresponding to the plurality of normalized sample access data sets are generated as a plurality of sample feature vectors, respectively.
In some embodiments, clustering the plurality of sample feature vectors to obtain a website classification model includes: and based on the preset second clustering number and the preset distance parameter, carrying out hierarchical clustering on the plurality of sample characteristic vectors by using a hierarchical clustering method to obtain a website classification model.
In some embodiments, the hierarchical clustering method includes at least one of: shortest distance method, longest distance method, average distance method, centroid distance method.
In a second aspect, an embodiment of the present application provides a website category obtaining apparatus, including: the acquisition unit is configured to acquire an order data set and an access data set of a target website within a first preset time period; the selecting unit is configured to analyze the order data set and the access data set, select order data from the order data set to generate a target order data set, and select access data from the access data set to generate a target access data set; an extraction unit configured to extract feature vectors from the target order data set and the target access data set; and the classification unit is configured to input the feature vectors into a pre-trained website classification model for classification to obtain a secondary category of the target website, wherein the website classification model is used for representing the corresponding relation between the feature vectors of the website and the secondary category of the website.
In some embodiments, the feature vector comprises at least one of: the order amount of the target website, the visitor volume of the target website and the browsing volume of the target website.
In some embodiments, the apparatus further comprises: the first query unit is configured to query a first corresponding relation table and acquire a first class to which a second class of the target website belongs, wherein the first corresponding relation table is used for storing the second class and the first class to which the second class belongs; the system comprises a category acquisition unit, a category selection unit and a category selection unit, wherein the category acquisition unit is configured to acquire an initial primary category submitted by a target website during registration; the determining unit is configured to determine whether a primary category to which a secondary category of the target website belongs is the same as the initial primary category; and the first output unit is configured to output the abnormal prompt message if the abnormal prompt message is different from the first output unit.
In some embodiments, the apparatus further comprises: the second query unit is configured to query a second corresponding relation table and acquire the order peak time period corresponding to the secondary category of the target website, wherein the second corresponding relation table is used for storing the secondary category and the order peak time period corresponding to the secondary category; and the second output unit is configured to output the order taking peak time period corresponding to the secondary category of the target website.
In some embodiments, the apparatus further comprises a website classification model building unit, the website classification model building unit comprising: the acquisition subunit is configured to respectively acquire an order data set and an access data set of a plurality of websites within a second preset time period; the selection subunit is configured to analyze the order data sets and the access data sets of the multiple websites, select order data from the order data sets of the multiple websites to generate multiple sample order data sets, and select access data from the access data sets of the multiple websites to generate multiple sample access data sets; an extraction subunit configured to extract a plurality of sample feature vectors from the plurality of sample order data sets and the plurality of sample access data sets, respectively; and clustering the subunits. The configuration is used for clustering the characteristic vectors of the multiple samples to obtain a website classification model.
In some embodiments, selecting the subunit includes: the deleting module is configured to delete the order data and the access data of the plurality of websites, wherein the fields of the order data set and the access data set are missing, so that a first order data set and a first access data set of the plurality of websites are obtained; the duplicate removal module is configured to perform duplicate removal processing on the first order data sets and the first access data sets of the multiple websites respectively to obtain second order data sets and second access data sets of the multiple websites; and the denoising module is configured to denoise the second order data sets and the second access data sets of the multiple websites based on a preset first cluster number to obtain multiple sample order data sets and multiple sample access data sets.
In some embodiments, the extraction subunit includes: the normalization module is configured to respectively perform normalization processing on the plurality of sample order data sets and the plurality of sample access data sets to obtain a plurality of normalized sample order data sets and a plurality of normalized sample access data sets; and the derivation module is configured to generate a first derivative set corresponding to the plurality of normalized sample order data sets and a first derivative set corresponding to the plurality of normalized sample access data sets respectively, and use the first derivative sets and the first derivative sets as a plurality of sample feature vectors.
In some embodiments, the clustering subunit is further configured to: and based on the preset second clustering number and the preset distance parameter, carrying out hierarchical clustering on the plurality of sample characteristic vectors by using a hierarchical clustering method to obtain a website classification model.
In some embodiments, the hierarchical clustering method includes at least one of: shortest distance method, longest distance method, average distance method, centroid distance method.
In a third aspect, an embodiment of the present application provides a server, where the server includes: one or more processors; storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first aspect.
According to the method and the device for acquiring the website category, the order data set and the access data set of the target website in the first preset time period are acquired so as to be analyzed, and therefore the target order data set and the target access data set are generated; then, extracting a characteristic vector from the target order data set and the target access data set; and finally, inputting the feature vectors into a pre-trained website classification model for classification, thereby obtaining the secondary category of the target website. The websites are classified through the website classification model, so that the website classification efficiency is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a website category acquisition method according to the present application;
FIG. 3 is a flow diagram for one embodiment of a method of modeling a classification of a web site according to the present application;
FIG. 4 is a schematic structural diagram of an embodiment of a website category acquisition device according to the present application;
FIG. 5 is a block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 shows an exemplary system architecture 100 to which the website category acquisition method or the website category acquisition apparatus of the present application may be applied.
As shown in fig. 1, system architecture 100 may include terminal device 101, database server 102, network 103, and server 104. Network 103 is the medium used to provide communication links between terminal devices 101, database server 102, and server 104. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal device 101 to interact with server 104 over network 103 to receive or send messages and the like. For example, the user may use the terminal device 101 to send the order data set and the access data set of the target website within the first preset time period to the server 104 through the network 103. The terminal device 101 may be various electronic devices including, but not limited to, a smart phone, a tablet computer, an e-book reader, a laptop portable computer, a desktop computer, and the like.
The database server 102 may also be configured to store the order data set and the access data set of the target website in the first preset time period, so that the server 104 obtains the order data set and the access data set of the target website in the first preset time period from the database server 102 through the network 103.
The server 104 may be a server that provides various services. For example, the server 104 may obtain the order data set and the access data set of the target website in the first preset time period from the terminal device 101 or the database server 102, perform processing such as analysis on the obtained order data set and the obtained access data set of the target website in the first preset time period, and output a processing result (for example, a secondary category of the target website).
It should be noted that the website category acquiring method provided in the embodiment of the present application is generally executed by the server 104, and accordingly, the website category acquiring device is generally disposed in the server 104.
It should be understood that the number of terminal devices, database servers, networks, and servers in fig. 1 are merely illustrative. There may be any number of terminal devices, database servers, networks, and servers, as desired for implementation. In the case where the server 104 stores the order data set and the access data set of the target website within the first preset time period, the terminal device 101 and the database server 102 may not be provided in the system architecture 100.
With continued reference to FIG. 2, a flow 200 of one embodiment of a website category acquisition method according to the present application is shown. The website category acquisition method comprises the following steps:
step 201, acquiring an order data set and an access data set of a target website in a first preset time period.
In this embodiment, the electronic device (for example, the server 104 shown in fig. 1) on which the website category acquiring method operates may acquire the order data set and the access data set of the target website within a first preset time period (for example, within a certain day, within a certain week, within a certain month, and the like). Here, a website is generally a collection of web pages that are created on the internet using a tool such as HTML (Hyper Text Markup Language) according to a certain rule and that are used to display specific content. For example, a website may be an online store on an e-commerce platform, and a target website may be an online store on an e-commerce platform.
In this embodiment, the order data set may be a data set related to an order of a user in a target website. Each piece of order data may include, but is not limited to: information on the target website (e.g., name of the target website, contact phone of the target website, address of the target website, etc.), information on the ordering user (e.g., account name of the ordering user, contact phone of the ordering user, address of the ordering user, etc.), information on the ordering item (e.g., name of the ordering item, SKU (Stock Keeping Unit) number of the ordering item, category of the ordering item, price of the ordering item, etc.), and the like. The access data set may be a data set related to access by the user in the target website. Wherein each piece of access data may include, but is not limited to: information of the target website (e.g., name of the target website, contact phone of the target website, address of the target website, etc.), information of the accessing user (e.g., account name of the accessing user, contact phone of the accessing user, address of the accessing user, etc.), information of the accessing item (e.g., name of the accessing item, SKU number of the accessing item, category of the accessing item, price of the accessing item, etc.), and the like.
It should be noted that, the electronic device may obtain the order data set and the access data set of the target website within the first preset time period from a local terminal (e.g., the terminal device 101 shown in fig. 1) communicatively connected thereto or a database server (e.g., the database server 102 shown in fig. 1) communicatively connected thereto, and this embodiment does not limit where the electronic device obtains the order data set and the access data set of the target website within the first preset time period.
Step 202, analyzing the order data set and the access data set, selecting order data from the order data set to generate a target order data set, and selecting access data from the access data set to generate a target access data set.
In this embodiment, based on the order data set and the access data set obtained in step 201, the electronic device may analyze the order data set and the access data set, obtain a target order data set from the order data set, and obtain a target access data set from the access data set.
In this embodiment, the electronic device may obtain the target order data set and the target access data set in various ways.
In some optional implementation manners of this embodiment, the electronic device may randomly select a plurality of order data from the order data set to generate a target order data set; the electronic device can randomly select a plurality of access data from the access data sets to generate a target access data set.
In some optional implementations of this embodiment, the electronic device may delete the order data and the access data with missing fields in the order data set and the access data set at first; and then, performing duplicate removal processing on the order data set and the access data set respectively to obtain a target order data set and a target access data set.
Step 203, extracting feature vectors from the target order data set and the target access data set.
In this embodiment, based on the target order data set and the target access data set generated in step 202, the electronic device may extract feature vectors from the target order data set and the target access data set. As an example, the electronic device may perform statistical analysis on the target order data set, so as to obtain an order quantity of the target website; the electronic equipment can also perform statistical analysis on the target access data set so as to obtain the visitor volume of the target website. At this time, the electronic device may use the order volume of the target website and the visitor volume of the target website as feature vectors; or carrying out normalization processing on the order quantity of the target website and the visitor volume of the target website, and taking the normalized order quantity of the target website and the normalized visitor volume of the target website as feature vectors.
In some optional implementations of this embodiment, the feature vector may include, but is not limited to, at least one of: the order amount of the target website, the visitor volume of the target website and the browsing volume of the target website.
And step 204, inputting the feature vectors into a pre-trained website classification model for classification to obtain a secondary category of the target website.
In this embodiment, based on the feature vectors extracted in step 203, the electronic device may input the feature vectors into a website classification model trained in advance for classification, so as to obtain a secondary category of the target website. Wherein the secondary category may be a business model category of the website. For example, secondary categories may include, but are not limited to: wholesale retail mode category, brick and mortar store online mode category, distribution mode category, and shopping mode category.
In this embodiment, the website classification model may be used to characterize a correspondence between a feature vector of a website and a secondary category of the website. Here, the electronic device may establish the website classification model in various ways. For example, the electronic device may generate a correspondence table storing correspondences between a plurality of feature vectors and secondary categories of websites based on statistics of secondary categories of websites and feature vectors for a large number of websites, and use the correspondence table as a website classification model.
In some optional implementation manners of this embodiment, after obtaining the secondary category of the target website, the electronic device may first query the first correspondence table to obtain a primary category to which the secondary category of the target website belongs; then, acquiring an initial primary category submitted by the target website during registration; then, whether the primary category to which the secondary category of the target website belongs is the same as the initial primary category is determined; and finally, outputting abnormal prompt information under the condition that the primary class to which the secondary class of the target website belongs is different from the initial primary class. The first correspondence table may be used to store the second-level category and the first-level category to which the second-level category belongs. The primary category may be a type of website, and the website may be divided into various types according to the types of goods sold on the website, for example, an electronic product type website, a book type website, a food type website, a medicine type website, a clothing type website, and the like. As an example, if the primary class to which the secondary class of the target website belongs is a medicine class, and the initial primary class submitted by the target website during registration is a clothing class, the electronic device may output an exception notification message for notifying that there may be a false registration situation in the target website.
In some optional implementation manners of this embodiment, after obtaining the secondary category of the target website, the electronic device may first query the second correspondence table to obtain an order taking peak time period corresponding to the secondary category of the target website; and then outputting the order taking peak time period corresponding to the secondary category of the target website. The second correspondence table may be used to store the secondary category and the order peak time period corresponding to the secondary category. Here, for each secondary category, a person skilled in the art may perform statistical analysis on order placing times of a large number of websites, thereby obtaining an order placing peak time period corresponding to each secondary category.
According to the website category acquisition method provided by the embodiment of the application, the order data set and the access data set of the target website in the first preset time period are acquired so as to be analyzed, and therefore the target order data set and the target access data set are generated; then, extracting a characteristic vector from the target order data set and the target access data set; and finally, inputting the feature vectors into a pre-trained website classification model for classification, thereby obtaining the secondary category of the target website. The websites are classified through the website classification model, so that the website classification efficiency is improved.
With further reference to FIG. 3, a flow 300 of one embodiment of a method of building a classification model for a web site is shown. The process 300 of the method for establishing a website classification model includes the following steps:
step 301, acquiring an order data set and an access data set of a plurality of websites in a second preset time period respectively.
In this embodiment, an electronic device (e.g., the server 104 shown in fig. 1) may obtain an order data set and an access data set of a plurality of websites in a second preset time period (e.g., within a certain day, a certain week, a certain month, etc.), respectively. The website may be an online store on an e-commerce platform.
Step 302, analyzing the order data sets and the access data sets of the multiple websites, selecting order data from the order data sets of the multiple websites to generate multiple sample order data sets, and selecting access data from the access data sets of the multiple websites to generate multiple sample access data sets.
In this embodiment, based on the order data sets and the access data sets of the multiple websites acquired in step 301, the electronic device may analyze the order data sets and the access data sets of the multiple websites, select order data from the order data sets of the multiple websites to generate multiple sample order data sets, and select access data from the access data sets of the multiple websites to generate multiple sample access data sets.
In this embodiment, the electronic device may obtain a plurality of sample order data sets and sample access data sets in various ways.
In some optional implementations of this embodiment, for each website in the multiple websites, the electronic device may randomly select a number of order data from the order data set of the website to generate a sample order data set of the website; the electronic device may randomly select a number of access data from the access data set of the website to generate a sample access data set of the website.
In some optional implementations of the present embodiment, the electronic device may obtain the plurality of sample order data sets and the sample access data set by the following steps.
First, the electronic device may delete the order data and the access data of the plurality of websites with missing fields to obtain a first order data set and a first access data set of the plurality of websites. Specifically, for each piece of order data or each piece of access data of each website, the electronic device may determine whether a field in the piece of order data or the piece of access data is complete, and if not, delete the piece of order data or the piece of access data.
Then, the electronic device may perform deduplication processing on the first order data set and the first access data set of the multiple websites respectively to obtain a second order data set and a second access data set of the multiple websites. Specifically, for the first order data set or the first access data set of each website, the electronic device may perform deduplication processing on the first order data set or the first access data set of the website to remove duplicate first order data in the first order data set or duplicate first access data in the first access data set of the website.
Finally, the electronic device may denoise the second order data sets and the second access data sets of the multiple websites based on a preset first cluster number (for example, the first cluster number takes a value between 12 and 17), so as to obtain multiple sample order data sets and multiple sample access data sets. Specifically, the electronic device may perform hierarchical clustering on the second order data sets and the second access data sets of the multiple websites by using a hierarchical clustering method based on the first clustering number to remove the second order data sets and the second access data sets of the websites where false registration may exist, and use the second order data sets and the second access data sets of the remaining websites as multiple sample order data sets and multiple sample access data sets.
Step 303, extracting a plurality of sample feature vectors from the plurality of sample order data sets and the plurality of sample access data sets, respectively.
In this embodiment, based on the plurality of sample order data sets and the plurality of sample access data sets generated in step 302, the electronic device may extract a plurality of sample feature vectors from the plurality of sample order data sets and the plurality of sample access data sets, respectively. Wherein the sample feature vector may include, but is not limited to, at least one of: the order amount of the website, the visitor volume of the website and the browsing volume of the website. As an example, for each sample order data set or each sample access data set, the electronic device may perform statistical analysis on the sample order data set, so as to obtain an order quantity corresponding to the sample order data set; the electronic equipment can also perform statistical analysis on the sample access data set so as to obtain the visitor volume corresponding to the sample access data set. At this time, the electronic device may use the order volume corresponding to the sample order data set and the visitor volume corresponding to the sample visit data set as sample feature vectors.
In some optional implementations of the present embodiment, the electronic device may extract the plurality of sample feature vectors by the following steps.
First, the electronic device may perform normalization processing on the plurality of sample order data sets and the plurality of sample access data sets, respectively, to obtain a plurality of normalized sample order data sets and a plurality of normalized sample access data sets. Here, the electronic device may perform the normalization process on the plurality of sample order data sets and the plurality of sample access data sets using a min-max normalization method. Specifically, the electronic device may first set a minimum value (min) and a maximum value (max); the original value x is then mapped into the interval min, max by the following min-max normalization formula]Value x of*
Figure BDA0001297950620000121
As an example, the order size of a website at times 7-12 of a day is shown in Table 1 below:
Figure BDA0001297950620000122
TABLE 1
If the order quantity at each time in table 1 is mapped to a normalized value in the interval [0,1] by the min-max normalization formula, the normalized order quantity at 7-12 times of a certain website at a certain day is as shown in table 2 below:
Figure BDA0001297950620000131
TABLE 2
Then, the electronic device may generate a set of first order derivatives corresponding to the plurality of normalized sample order data sets and a set of first order derivatives corresponding to the plurality of normalized sample access data sets, respectively, as a plurality of sample feature vectors. Following the above example, the electronic device may utilize the following formula to derive a first derivative f' (x) corresponding to the normalized order quantity within each time instance* i):
Figure BDA0001297950620000132
Wherein i is a positive integer, i is more than or equal to 7 and less than or equal to 12, x*For normalized order quantity, x* iIs the normalized order quantity, f' (x) at the i-th moment*) Is the first derivative, f' (x), corresponding to the normalized order quantity* i) The first derivative corresponding to the normalized order quantity at time i.
And step 304, clustering the plurality of sample feature vectors to obtain a website classification model.
In this embodiment, based on the plurality of sample feature vectors extracted in step 303, the electronic device may cluster the plurality of sample feature vectors, so as to establish a trained website classification model of an accurate correspondence between the feature vectors of the website and the secondary categories of the website. Clustering, among other things, is generally the process of dividing a collection of physical or abstract objects into classes composed of similar objects. The class generated by the clustering is a collection of a set of data objects that are similar to objects in the same class and distinct from objects in other classes. Here, clustering the plurality of sample feature vectors may generate a plurality of classes, each class corresponding to a secondary class.
In some optional implementation manners of this embodiment, the electronic device may perform hierarchical clustering on the plurality of sample feature vectors by using a hierarchical clustering method based on a preset second clustering number (generally, the second clustering number is generally smaller than the first clustering number, for example, the second clustering number takes a value between 2 and 5) and a preset distance parameter, so as to obtain a website classification model. Hierarchical clustering is a main clustering method, and clustering is completed by generating a series of nested clustering trees. The single-point clusters are at the bottom of the tree, and there is a root node cluster at the top of the tree. The root node cluster covers all data points. Hierarchical clustering can be divided into merged (bottom-up) clustering and split (top-down) clustering, where merged clustering is employed. The distance parameter may include a distance value between two classes and a distance value between two objects of the same class. Here, the distance indicated by the distance parameter may be a euclidean distance or a manhattan distance. The termination condition of the hierarchical clustering is that the distance between two classes and the distance between two objects in the same class reach the distance indicated by the distance parameter or the number of the classes reaches the number of the second clusters.
In some optional implementations of this embodiment, the hierarchical clustering method may include, but is not limited to, at least one of: the shortest-distance method (SL method), the longest-distance method (CL method), the average-distance method (AL method), and the centroid-distance method (centroid-distance). Wherein, the distance between the classes of the shortest distance method is equal to the minimum distance between the two classes of objects. The distance between classes of the longest distance method is equal to the maximum distance between two classes of objects. The inter-class distance of the mean distance method is equal to the mean distance between two classes of objects. The class spacing of the centroid distance method is equal to the distance between the centroids of the two classes of objects.
According to the method for establishing the website classification model, order data sets and access data sets of a plurality of websites in a second preset time period are obtained, so that the order data sets and the access data sets of the websites can be analyzed, and a plurality of sample order data sets and sample access data sets are generated; then, extracting a plurality of sample characteristic vectors from the plurality of sample order data sets and the plurality of sample access data sets respectively; and finally, clustering the plurality of sample feature vectors to obtain a website classification model. Therefore, the website classification model for quickly establishing the accurate corresponding relation between the feature vector of the website and the secondary category of the website is realized.
With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of a website category obtaining apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 4, the website category acquiring apparatus 400 of the present embodiment may include: an acquisition unit 401, a selection unit 402, an extraction unit 403, and a classification unit 404. The acquiring unit 401 is configured to acquire an order data set and an access data set of a target website within a first preset time period; a selecting unit 402, configured to analyze the order data set and the access data set, select order data from the order data set to generate a target order data set, and select access data from the access data set to generate a target access data set; an extracting unit 403 configured to extract feature vectors from the target order data set and the target access data set; the classification unit 404 is configured to input the feature vector to a pre-trained website classification model for classification, so as to obtain a secondary category of the target website, where the website classification model is used to represent a correspondence between the feature vector of the website and the secondary category of the website.
In the present embodiment, in the website category acquiring apparatus 400: the specific processing of the obtaining unit 401, the selecting unit 402, the extracting unit 403, and the classifying unit 404 and the technical effects thereof can refer to the related descriptions of step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.
In some optional implementations of this embodiment, the feature vector includes at least one of: the order amount of the target website, the visitor volume of the target website and the browsing volume of the target website.
In some optional implementations of this embodiment, the website category obtaining apparatus 400 may further include: a first query unit (not shown in the figure), configured to query a first correspondence table, and obtain a first class to which a second class of the target website belongs, where the first correspondence table is used to store the second class and the first class to which the second class belongs; a category acquisition unit (not shown in the figure) configured to acquire an initial primary category submitted by a target website at the time of registration; the determining unit is configured to determine whether a primary category to which a secondary category of the target website belongs is the same as the initial primary category; and a first output unit (not shown in the figure) configured to output the abnormal prompt message if the two are different.
In some optional implementations of this embodiment, the website category obtaining apparatus 400 may further include: a second query unit (not shown in the figure), configured to query a second correspondence table, and obtain an order peak time period corresponding to a secondary category of the target website, where the second correspondence table is used to store the secondary category and the order peak time period corresponding to the secondary category; and a second output unit (not shown in the figure) configured to output the order taking peak time period corresponding to the secondary category of the target website.
In some optional implementations of this embodiment, the website category obtaining apparatus 400 may further include a website classification model building unit (not shown in the figure), and the website classification model building unit may include: an obtaining subunit (not shown in the figure), configured to obtain an order data set and an access data set of the multiple websites in a second preset time period, respectively; a selecting subunit (not shown in the figure), configured to analyze the order data sets and the access data sets of the multiple websites, select order data from the order data sets of the multiple websites to generate multiple sample order data sets, and select access data from the access data sets of the multiple websites to generate multiple sample access data sets; an extraction subunit (not shown in the figure) configured to extract a plurality of sample feature vectors from the plurality of sample order data sets and the plurality of sample access data sets, respectively; clustering subunits (not shown in the figure). The configuration is used for clustering the characteristic vectors of the multiple samples to obtain a website classification model.
In some optional implementations of this embodiment, selecting the sub-unit may include: a deleting module (not shown in the figure) configured to delete the order data and the access data of the plurality of websites with missing fields in the order data set and the access data set, so as to obtain a first order data set and a first access data set of the plurality of websites; a duplicate removal module (not shown in the figure) configured to perform duplicate removal processing on the first order data set and the first access data set of the multiple websites respectively to obtain a second order data set and a second access data set of the multiple websites; and a denoising module (not shown in the figure) configured to denoise the second order data sets and the second access data sets of the multiple websites based on a preset number of the first clusters to obtain multiple sample order data sets and multiple sample access data sets.
In some optional implementations of this embodiment, the extracting subunit may include: a normalization module (not shown in the figure), configured to perform normalization processing on the plurality of sample order data sets and the plurality of sample access data sets, respectively, to obtain a plurality of normalized sample order data sets and a plurality of normalized sample access data sets; a derivation module (not shown in the figure) configured to generate a set of first derivatives corresponding to the plurality of normalized sample order data sets and a set of first derivatives corresponding to the plurality of normalized sample access data sets, respectively, as a plurality of sample feature vectors.
In some optional implementations of this embodiment, the clustering subunit is further configured to: and based on the preset second clustering number and the preset distance parameter, carrying out hierarchical clustering on the plurality of sample characteristic vectors by using a hierarchical clustering method to obtain a website classification model.
In some optional implementations of this embodiment, the hierarchical clustering method includes at least one of: shortest distance method, longest distance method, average distance method, centroid distance method.
Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing a server according to embodiments of the present application is shown. The server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a selection unit, an extraction unit, and a classification unit. The names of the units do not form a limitation to the units themselves in some cases, for example, the acquiring unit may also be described as a unit for acquiring the order data set and the access data set of the target website in the first preset time period.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the server described in the above embodiments; or may exist separately and not be assembled into the server. The computer readable medium carries one or more programs which, when executed by the server, cause the server to: acquiring an order data set and an access data set of a target website in a first preset time period; analyzing the order data set and the access data set, selecting order data from the order data set to generate a target order data set, and selecting access data from the access data set to generate a target access data set; extracting a characteristic vector from the target order data set and the target access data set; and inputting the feature vectors into a pre-trained website classification model for classification to obtain a secondary category of the target website, wherein the website classification model is used for representing the corresponding relation between the feature vectors of the website and the secondary category of the website.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (12)

1. A website category acquisition method is characterized by comprising the following steps:
acquiring an order data set and an access data set of a target website in a first preset time period;
analyzing the order data set and the access data set, selecting order data from the order data set to generate a target order data set, and selecting access data from the access data set to generate a target access data set;
extracting feature vectors from the target order data set and the target access data set;
inputting the feature vectors into a pre-trained website classification model for classification to obtain a secondary category of the target website, wherein the website classification model is used for representing the corresponding relation between the feature vectors of the website and the secondary category of the website;
inquiring a second corresponding relation table to obtain the order taking peak time period corresponding to the secondary category of the target website, wherein the second corresponding relation table is used for storing the secondary category and the order taking peak time period corresponding to the secondary category;
and outputting the order-off peak time period corresponding to the secondary category of the target website.
2. The method of claim 1, wherein the feature vector comprises at least one of: the order amount of the target website, the visitor volume of the target website and the browsing volume of the target website.
3. The method of claim 1, wherein after the inputting the feature vector into a pre-trained website classification model for classification to obtain a secondary category of the target website, further comprising:
inquiring a first corresponding relation table to obtain a first class to which a second class of the target website belongs, wherein the first corresponding relation table is used for storing the second class and the first class to which the second class belongs;
acquiring an initial primary category submitted by the target website during registration;
determining whether the primary category to which the secondary category of the target website belongs is the same as the initial primary category;
if not, outputting abnormal prompt information.
4. The method according to any one of claims 1 to 3, wherein the method further comprises the step of establishing a website classification model, the step of establishing a website classification model comprising:
respectively acquiring order data sets and access data sets of a plurality of websites in a second preset time period;
analyzing the order data sets and the access data sets of the websites, selecting order data from the order data sets of the websites to generate a plurality of sample order data sets, and selecting access data from the access data sets of the websites to generate a plurality of sample access data sets;
extracting a plurality of sample feature vectors from the plurality of sample order data sets and the plurality of sample access data sets respectively;
and clustering the sample feature vectors to obtain a website classification model.
5. The method of claim 4, wherein analyzing the order data sets and visit data sets of the plurality of web sites, extracting order data from the order data sets of the plurality of web sites to generate a plurality of sample order data sets, and extracting visit data from the visit data sets of the plurality of web sites to generate a plurality of sample visit data sets comprises:
deleting the order data and the access data with missing fields in the order data sets and the access data sets of the multiple websites to obtain a first order data set and a first access data set of the multiple websites;
respectively carrying out duplicate removal processing on the first order data sets and the first access data sets of the multiple websites to obtain second order data sets and second access data sets of the multiple websites;
denoising the second order data sets and the second access data sets of the multiple websites based on a preset first cluster number to obtain multiple sample order data sets and multiple sample access data sets.
6. The method of claim 4, wherein extracting a plurality of sample feature vectors from the plurality of sample order data sets and the plurality of sample access data sets, respectively, comprises:
respectively carrying out normalization processing on the plurality of sample order data sets and the plurality of sample access data sets to obtain a plurality of normalized sample order data sets and a plurality of normalized sample access data sets;
generating a set of first order derivatives corresponding to the plurality of normalized sample order data sets and a set of first order derivatives corresponding to the plurality of normalized sample access data sets, respectively, as a plurality of sample feature vectors.
7. The method of claim 4, wherein clustering the plurality of sample feature vectors to obtain a website classification model comprises:
and based on the preset second clustering number and the preset distance parameter, carrying out hierarchical clustering on the plurality of sample characteristic vectors by using a hierarchical clustering method to obtain a website classification model.
8. The method of claim 7, wherein the hierarchical clustering method comprises at least one of: shortest distance method, longest distance method, average distance method, centroid distance method.
9. A website category acquisition apparatus, comprising:
the acquisition unit is configured to acquire an order data set and an access data set of a target website within a first preset time period;
the selecting unit is configured to analyze the order data set and the access data set, select order data from the order data set to generate a target order data set, and select access data from the access data set to generate a target access data set;
an extraction unit configured to extract feature vectors from the target order data set and the target access data set;
the classification unit is configured to input the feature vectors into a pre-trained website classification model for classification to obtain a secondary category of the target website, wherein the website classification model is used for representing a corresponding relation between the feature vectors of the website and the secondary category of the website;
the second query unit is configured to query a second corresponding relation table and acquire an order peak time period corresponding to a secondary category of the target website, wherein the second corresponding relation table is used for storing the secondary category and the order peak time period corresponding to the secondary category;
and the second output unit is configured to output the order taking peak time period corresponding to the secondary category of the target website.
10. The apparatus of claim 9, further comprising a website classification model building unit, wherein the website classification model building unit comprises:
the acquisition subunit is configured to respectively acquire an order data set and an access data set of a plurality of websites within a second preset time period;
the selecting subunit is configured to analyze the order data sets and the access data sets of the multiple websites, select order data from the order data sets of the multiple websites to generate multiple sample order data sets, and select access data from the access data sets of the multiple websites to generate multiple sample access data sets;
an extraction subunit configured to extract a plurality of sample feature vectors from the plurality of sample order data sets and the plurality of sample access data sets, respectively;
and the clustering subunit is configured to cluster the plurality of sample feature vectors to obtain a website classification model.
11. A server, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.
CN201710351636.4A 2017-05-18 2017-05-18 Website category acquisition method and device Active CN108959289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710351636.4A CN108959289B (en) 2017-05-18 2017-05-18 Website category acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710351636.4A CN108959289B (en) 2017-05-18 2017-05-18 Website category acquisition method and device

Publications (2)

Publication Number Publication Date
CN108959289A CN108959289A (en) 2018-12-07
CN108959289B true CN108959289B (en) 2022-04-26

Family

ID=64462802

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710351636.4A Active CN108959289B (en) 2017-05-18 2017-05-18 Website category acquisition method and device

Country Status (1)

Country Link
CN (1) CN108959289B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111882265A (en) * 2020-06-29 2020-11-03 深圳市法本信息技术股份有限公司 Cross-border e-commerce automatic customs declaration method and automatic customs declaration robot
CN112417893A (en) * 2020-12-16 2021-02-26 江苏徐工工程机械研究院有限公司 Software function demand classification method and system based on semantic hierarchical clustering
CN114615262A (en) * 2022-01-30 2022-06-10 阿里巴巴(中国)有限公司 Network aggregation method, storage medium, processor and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324628B (en) * 2012-03-21 2016-06-08 腾讯科技(深圳)有限公司 A kind of trade classification method and system for issuing text
US9262646B1 (en) * 2013-05-31 2016-02-16 Symantec Corporation Systems and methods for managing web browser histories
JP6344395B2 (en) * 2013-09-20 2018-06-20 日本電気株式会社 Payout amount prediction device, payout amount prediction method, program, and payout amount prediction system
CN103605794B (en) * 2013-12-05 2017-02-15 国家计算机网络与信息安全管理中心 Website classifying method
CN103744981B (en) * 2014-01-14 2017-02-15 南京汇吉递特网络科技有限公司 System for automatic classification analysis for website based on website content
CN104809125A (en) * 2014-01-24 2015-07-29 腾讯科技(深圳)有限公司 Method and device for identifying webpage categories
CN105184574B (en) * 2015-06-30 2018-09-07 电子科技大学 A kind of detection method for applying mechanically trade company's classification code fraud
CN106682217A (en) * 2016-12-31 2017-05-17 成都数联铭品科技有限公司 Method for enterprise second-grade industry classification based on automatic screening and learning of information

Also Published As

Publication number Publication date
CN108959289A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN107729319B (en) Method and apparatus for outputting information
US20190163742A1 (en) Method and apparatus for generating information
CN109145280A (en) The method and apparatus of information push
CN107679217B (en) Associated content extraction method and device based on data mining
CN107105031A (en) Information-pushing method and device
CN108269122B (en) Advertisement similarity processing method and device
CN109388548B (en) Method and apparatus for generating information
CN107908616B (en) Method and device for predicting trend words
CN110827112B (en) Deep learning commodity recommendation method and device, computer equipment and storage medium
CN107944032B (en) Method and apparatus for generating information
CN113688310B (en) Content recommendation method, device, equipment and storage medium
CN108959289B (en) Website category acquisition method and device
CN107977678A (en) Method and apparatus for output information
CN109190123A (en) Method and apparatus for output information
CN108512674B (en) Method, device and equipment for outputting information
CN107346344A (en) The method and apparatus of text matches
CN107908662B (en) Method and device for realizing search system
CN111882224A (en) Method and device for classifying consumption scenes
CN107357847B (en) Data processing method and device
CN107679030B (en) Method and device for extracting synonyms based on user operation behavior data
CN110069691A (en) For handling the method and apparatus for clicking behavioral data
CN115563942A (en) Contract generation method and device, electronic equipment and computer readable medium
CN110827101A (en) Shop recommendation method and device
CN107483595A (en) Information-pushing method and device
CN110069753A (en) A kind of method and apparatus generating similarity information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant