WO2021258968A1 - 小程序分类方法、装置、设备及计算机可读存储介质 - Google Patents

小程序分类方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021258968A1
WO2021258968A1 PCT/CN2021/096021 CN2021096021W WO2021258968A1 WO 2021258968 A1 WO2021258968 A1 WO 2021258968A1 CN 2021096021 W CN2021096021 W CN 2021096021W WO 2021258968 A1 WO2021258968 A1 WO 2021258968A1
Authority
WO
WIPO (PCT)
Prior art keywords
applet
classifier
classifier models
information
classified
Prior art date
Application number
PCT/CN2021/096021
Other languages
English (en)
French (fr)
Inventor
高璇璇
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021258968A1 publication Critical patent/WO2021258968A1/zh
Priority to US17/732,382 priority Critical patent/US20220253307A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44568Immediately runnable code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Definitions

  • This application relates to the technical field of applet classification, and relates to but not limited to a method, device, device, and computer-readable storage medium for applet classification.
  • Mini Program is an application form between traditional H5 web pages and traditional native Android/IOS applications.
  • the applet can be used without downloading and installing. Compared with the dedicated client, it saves the installation process and realizes the dream of application "at your fingertips”. Therefore, it has a very wide range of users and developers.
  • non-service small programs can be divided into non-service small programs and service small programs.
  • non-service small programs refer to small programs that can only display some basic information, such as company introduction or resume display, and do not provide other actual services.
  • a service applet refers to an applet that can provide actual services such as reservation service, ordering service, and check-in service. In order to show the user the applet that can provide actual services when the user searches the applet, because the applet can be classified when it is put on the shelf to identify the non-service applet.
  • the embodiments of the present application provide a method, device, device, and computer-readable storage medium for classifying small programs.
  • the small programs are classified based on the dynamic characteristics of the small programs, which can improve the accuracy of the classification results.
  • the embodiment of the present application provides a method for classifying small programs.
  • the method is applied to a small program classification device and includes:
  • An embodiment of the application provides a small program classification device, including:
  • the first obtaining module is configured to obtain the applet code of the applet to be classified
  • the running module is configured to run the code of the applet, and obtain the dynamic characteristics of the applet to be classified in the running process;
  • the first determining module is configured to input the dynamic feature into the trained classifier model to obtain the classification information of the applet to be classified;
  • the storage module is configured to store the classification information of the applet to be classified.
  • the embodiment of the present application provides a small program classification device, including:
  • the memory is configured to store executable instructions; the processor is configured to implement the above-mentioned small program classification method when the executable instructions stored in the memory are executed.
  • the embodiment of the present application provides a computer-readable storage medium that stores executable instructions for causing a processor to execute, to implement the above-mentioned small program classification method.
  • the applet code of the applet to be classified After obtaining the applet code of the applet to be classified, run the applet code to obtain the dynamic characteristics of the applet to be classified in the running process, and then input the dynamic characteristics into the trained classifier model to determine the applet to be classified
  • the classification information of the applet is stored and the classification information is stored. Since the dynamic features are extracted during the running of the applet, it can reflect the actual performance characteristics of the applet during the use process, and then use the dynamic characteristics of the applet to classify the applet, which can improve the accuracy of the classification results.
  • FIG. 1 is a schematic diagram of a network architecture of a small program classification system provided by an embodiment of this application;
  • FIG. 2 is a schematic structural diagram of a server 300 provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of an implementation process of a method for classifying applets provided by an embodiment of the application
  • 4A is a schematic diagram of an implementation process for obtaining a trained classifier model provided by an embodiment of this application;
  • 4B is a schematic diagram of another implementation process for obtaining a trained classifier model provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of another implementation process of the method for classifying applets provided by an embodiment of the application.
  • Mini programs which can also be called web applications, are downloaded by the client (such as a browser or any client with embedded browser core) via the network (such as the Internet), and interpreted and executed in the client's browser environment software. It is an application form between traditional H5 webpages and traditional native Android/IOS applications; for example, a network that can be downloaded and run in the social network client for various services such as ticket purchases, bus codes, etc. application.
  • accuracy rate (TP+TN)/(TP+TN+FP+FN); where TP is predicted as 1, actual is 1, and prediction is correct; FP is predicted as 1, and actual is 0 , The prediction is wrong; FN is the prediction is 0, the actual is 1, and the prediction is wrong; TN is the prediction is 0, the actual is 0, and the prediction is correct.
  • the accuracy rate can judge the total accuracy rate, it cannot be used as a good indicator to measure the result in the case of unbalanced samples.
  • Precision which can also be called precision. Precision refers to the prediction result, and its meaning is the probability of a positive sample among all samples that are predicted to be positive.
  • the recall rate is for the original sample, and its meaning is the probability of being predicted as a positive sample in the actual positive sample.
  • recall rate TP/(TP+FN).
  • F1 score (F1-Score), F1 score considers both precision rate and recall rate at the same time, so that the two can reach the highest at the same time and achieve a balance.
  • Receiver Operating Characteristic (ROC) curve is used to evaluate the pros and cons of a binary classifier. Compared with evaluation indicators such as accuracy, recall, and F-score, the ROC curve has such a good feature: when the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged.
  • the area under the ROC curve (AUC, Area Under Curve), represents the area under the curve in the ROC, and is used to judge the pros and cons of the model. As shown by the ROC curve, the area connecting the diagonal lines is exactly 0.5. The meaning of the diagonal lines is to judge the prediction results randomly, and the coverage of both positive and negative samples should be 50%. In addition, the steeper the ROC curve, the better, so the ideal value is 1, which is square. Therefore, the value of AUC is generally between 0.5 and 1.
  • the method adopted is based on static statistical characteristics and rules, that is, the number of keys in the static code of the applet is counted, and the number of keys less than the specified value is taken as a non-service applet. program.
  • the applet classification device provided by the embodiment of the application can be implemented as a notebook computer, a tablet computer, a desktop computer, and a mobile device (for example, a mobile phone, a portable music player).
  • a mobile device for example, a mobile phone, a portable music player.
  • Any terminal with a screen display function such as a personal digital assistant, a dedicated messaging device, a portable game device, and an intelligent robot, can also be implemented as a server.
  • an exemplary application when the applet classification device is implemented as a server will be explained.
  • FIG. 1 is a schematic diagram of a network architecture of a small program classification system provided by an embodiment of the application.
  • the applet classification system includes a user terminal 100, a developer terminal 200, and a server 300.
  • the developer of the applet deploys the development framework of the applet on the developer terminal 200 (for example, a user terminal such as a computer) to complete the code development for the applet.
  • the applet can be used to implement services provided by various service providers, such as a ride code Services, express services and online shopping, etc.
  • the development framework provides a small program building tool to encapsulate the code in the small program project into one or more JavaScript files that can be run in the browser environment of the client, and Upload to the server 300 to request review and put it on the shelf after the server 300 passes the review.
  • the server 300 may be a server that carries the business logic of the business party, for example, a back-end server that carries the ride service of the ride service provider.
  • the applet is stored in the server 300 corresponding to the first client.
  • the server 300 may also be a dedicated storage server, for example, a node in a content delivery network (CDN, Content Delivery Network) that has the shortest link with the user's terminal.
  • CDN Content Delivery Network
  • the user terminal 100 may send an access request for accessing the applet store to the server 300.
  • the server 300 determines the types of applets commonly used by the user terminal 100.
  • the types may refer to game types, Shopping, travel, etc., and based on the determined common applet types, determine the service applets that match the user's common applet types, and return them to the user terminal 100 in the access response for easy use by the user.
  • FIG. 2 is a schematic structural diagram of a server 300 provided in an embodiment of the application.
  • the server 300 shown in FIG. 2 includes: at least one processor 310, a memory 340, and at least one network interface 320.
  • the various components in the server 300 are coupled together through the bus system 330.
  • the bus system 330 is used to implement connection and communication between these components.
  • the bus system 330 also includes a power bus, a control bus, and a status signal bus.
  • various buses are marked as the bus system 330 in FIG. 2.
  • the processor 310 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gates or transistor logic devices, or discrete hardware Components, etc., where the general-purpose processor may be a microprocessor or any conventional processor.
  • DSP Digital Signal Processor
  • the memory 340 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and so on.
  • the memory 340 optionally includes one or more storage devices that are physically remote from the processor 310.
  • the memory 340 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory.
  • the non-volatile memory may be a read only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory).
  • the memory 340 described in the embodiment of the present application is intended to include any suitable type of memory.
  • the memory 340 can store data to support various operations. Examples of these data include programs, modules, and data structures, or a subset or superset thereof, as illustrated below.
  • Operating system 341 including system programs used to process various basic system services and perform hardware-related tasks, such as framework layer, core library layer, driver layer, etc., used to implement various basic services and process hardware-based tasks;
  • the network communication module 342 is configured to reach other computing devices via one or more (wired or wireless) network interfaces 320.
  • Exemplary network interfaces 320 include: Bluetooth, Wireless Compatibility Authentication (WiFi), and Universal Serial Bus ( USB, Universal Serial Bus), etc.;
  • the input processing module 343 is configured to detect one or more user inputs or interactions from one of the one or more input devices 332 and translate the detected inputs or interactions.
  • the device provided in the embodiments of the present application can be implemented in software.
  • FIG. 2 shows a small program classification device 344 stored in the memory 340.
  • the small program classification device 344 may be a small program in the server 300.
  • the classification device which can be software in the form of programs and plug-ins, includes the following software modules: a first acquisition module 3441, an operation module 3442, a first determination module 3443, and a storage module 3444. These modules are logically based and are therefore implemented according to the The functions can be combined or split arbitrarily. The function of each module will be explained below.
  • the device provided in the embodiment of the application may be implemented in hardware.
  • the device provided in the embodiment of the application may be a processor in the form of a hardware decoding processor, which is programmed to execute the application.
  • the small program classification method provided in the embodiment for example, a processor in the form of a hardware decoding processor may adopt one or more application specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, and programmable logic device (PLD, Programmable Logic). Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field-Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processing
  • PLD programmable logic device
  • CPLD Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • FIG. 3 is a schematic diagram of an implementation process of the method for classifying small programs provided by an embodiment of the application, which will be described in conjunction with the steps shown in FIG. 3 .
  • Step S101 Obtain the applet code of the applet to be classified.
  • the applet to be classified may be submitted to the server by the applet developer after developing the applet code.
  • the server may obtain the small program to be classified at intervals of a period of time. For example, it may be every 12 hours to obtain the waiting period received from the historical moment 12 hours ago to the current moment. Classification applet.
  • step S101 it can also be that whenever the server receives the applet submitted by the developer terminal, it is determined as the applet to be classified and the subsequent steps are executed, that is to say, the applet to be classified can be Obtained in real time.
  • step S102 the code of the applet can be rendered and run dynamically.
  • Each preset event is triggered during the operation of the applet, and the JS API called when each event is triggered is recorded.
  • the number of bound controls take screenshots and base64 encoding of the applet pages that have not started to trigger events and that have triggered all events.
  • the above data obtained are merged and saved as a json file, and then through the json file, Extract the dynamic characteristics of the applet to be classified.
  • the dynamic characteristics of the applet may include statistical characteristics, API characteristics, and image characteristics, where the statistical characteristics may include the number of successful trigger events, the number of APIs, and the number of interactive controls; the API characteristics include each API The total number of times it is called; the image characteristics can include the difference information between the screenshots of the applet page that has not started to trigger the event and that after all the events have been triggered.
  • An interactive control may refer to a control that can present a page jump or an interactive page in response to a touch operation or a click operation during the running of the applet.
  • Step S103 Input the dynamic feature into the trained classifier model to obtain classification information of the applet to be classified.
  • the trained classifier model may include, but is not limited to, logistic regression, support vector machine, decision tree, naive Bayes, K nearest neighbor, baggingK nearest neighbor, bagging decision tree, random forest, adaboost, gradient boosting decision tree.
  • the trained classification model may only include one or more trained base classifiers.
  • dynamic features may be input to the trained one or more base classifiers.
  • one or more initial prediction values are obtained respectively, and then the target prediction value is determined according to the one or more initial prediction values.
  • the target prediction value is the probability that the applet to be classified is a no-service applet, and finally the classification information of the applet to be classified is determined according to the target predicted value.
  • the classification information can be a service applet or a non-service applet.
  • the trained classifier model may also include not only trained multiple base classifiers, but also a trained ensemble classifier.
  • dynamic features can be input to Among the multiple trained base classifiers, multiple initial prediction values are obtained respectively, and then the multiple initial prediction values are input into the integrated classifier for data integration to obtain the target prediction value, and finally the target prediction value is determined according to the target prediction value to be classified. Classification information of the program.
  • Step S104 storing the classification information of the small program to be classified.
  • step S104 when step S104 is implemented, the corresponding relationship between the identification of the applet to be classified and the classification information may be stored, or the classification information may be stored as an attribute information of the applet to be classified.
  • the applet code is run to obtain the dynamic characteristics of the applet to be classified in the running process, and then based on the The dynamic feature and the trained classifier model determine the classification information of the applet to be classified and store the classification information. Since the dynamic features are extracted during the running of the applet, it can reflect the events that the applet can actually trigger and the API called during the use of the applet, and then use the dynamic characteristics of the applet to classify the applet, which can improve the recall of the classification results Rate.
  • step S102 shown in FIG. 3 may be implemented through the following steps:
  • Step S1021 Run the applet code to obtain the first applet interface image.
  • step S1021 can render and run the applet code when it is implemented.
  • the applet interface can be rendered on the display interface of the server to obtain the current first applet interface image.
  • the first applet interface image is also not available The interface image before any event is triggered.
  • Step S1022 Trigger each preset event in turn, obtain the successfully triggered target event, and obtain the application program interface information that is called when the target event is triggered and the control information corresponding to the target event.
  • the preset event may be a single-click event, a double-click event, or a long-press event of each control in the applet to be classified.
  • step S1022 first identify each control in the applet interface image based on the applet interface image, and then trigger a preset event based on each control, obtain the target event that is successfully triggered, and obtain the application that is called when the target event is triggered Interface information, where the application program interface information includes at least the application program interface identifier and the number of times the application program interface is called.
  • the control information corresponding to the target event is also obtained, where the control through which the target event is successfully triggered, then the attribute information of the control is also the control information corresponding to the target event.
  • the control information may include at least a control identifier.
  • the applet interface may change. Therefore, after an event is successfully triggered, the applet interface image can be obtained again, and the controls in the current applet interface image can be identified. Then, a preset event is triggered for the control in the current applet interface image, and the successfully triggered target event, the application program interface information that is called when the target event is triggered, and the control information corresponding to the target event are obtained.
  • Step S1023 after triggering each of the preset events, acquire the second applet interface image.
  • Step S1024 based on the number of target events, the application program interface information, the control information, the first applet interface image and the second applet interface image, determine the dynamic characteristics of the applet to be classified in the running process.
  • step S1024 when step S1024 is implemented, the total number of calls of each application interface can be determined based on the application interface information, the number of interactive controls can be determined based on the control information, and the first applet interface image and the second applet can be determined.
  • the image difference information may be obtained by calculating the Hamming distance between the first applet interface image and the second applet interface image.
  • Calculating the Hamming distance between the interface image of the first applet and the image of the second applet can be realized by first transforming the size of the first applet interface image and the second applet interface image, and then changing the size of the first applet
  • the interface image and the second applet interface image are processed into images of the same size and smaller size, for example, can be reduced to 8*8, or 16*16 size, and then perform color simplification processing, for example, the reduced image can be reduced Converted to 64-level grayscale. In other words, there are only 64 colors for all pixels, and then calculate the average gray value, and then calculate the grayscale and average value of each pixel in the first small program image after the simplified color and the small program image after the simplified color.
  • the total number of application program interfaces called by the applet to be classified can also be determined.
  • the number of target events, the total number of calls of each application interface The total number of application program interfaces, the number of interactive controls, and image difference information are determined as the dynamic characteristics of the applet to be classified.
  • the total number of application program interfaces of the applet, the number of interactive controls, the total number of calls of each application program interface and the number of target events that can be triggered are obtained by running the applet.
  • the dynamic feature can ensure that the obtained feature can truly reflect the actual situation of the applet, and therefore, the recall rate of the classification result can be improved when the classification information of the applet is determined by the dynamic feature.
  • the trained classifier model includes at least K first classifier models that have been trained.
  • step S103 can be implemented through the following steps:
  • step S1031 the dynamic features are respectively input into K first classifier models, and K initial prediction values are obtained correspondingly.
  • the first classifier model may include, but is not limited to, logistic regression, support vector machine, decision tree, naive Bayes, K nearest neighbor, baggingK nearest neighbor, bagging decision tree, random forest, adaboost, gradient boosting decision tree.
  • the dynamic characteristics of the applet to be classified are respectively input into the trained K first classifier models, and the K first classifier models perform prediction processing on the applet to be classified, and correspondingly obtain K initial prediction values.
  • the initial prediction value is the initial probability value that the applet to be classified is a no-service applet, and is a real number between 0 and 1.
  • Step S1032 Determine the target predicted value based on the K initial predicted values.
  • step S1032 when K is 1, then the initial predicted value is directly determined as the target predicted value.
  • K is an integer greater than 1, the K initial predicted values can be averaged to obtain the target predicted value.
  • the average value processing can be an arithmetic average or a weighted average.
  • the first classifier model may be a base classifier model
  • the trained classifier model further includes a trained second classifier model.
  • the K initial prediction values may be input to the second classifier model to perform integration processing on the K initial prediction values to obtain the target prediction value.
  • Step S1033 Determine the classification information of the applet to be classified based on the target predicted value and the preset classification threshold.
  • the classification information of the applet to be classified can be determined by judging the relationship between the target predicted value and the classification threshold.
  • the classification information of the applet to be classified can be determined.
  • the classification information is the first type of applet; when the target predicted value is less than or equal to the classification threshold, it is determined that the classification information of the applet to be classified is the second type of applet, and the first type of applet is a no-service applet.
  • the second type of applet is a service applet.
  • a trained classifier model needs to be obtained before step S103.
  • the trained classifier model can be obtained through the following steps shown in FIG. 4A:
  • Step S001 Obtain a first training data set and M preset first candidate classifier models.
  • the first training data set includes the dynamic characteristics of multiple training applets and label information of the multiple training applets, and the label information is used to characterize whether the training applet is a serviceless applet or a service applet, for example, when When the training applet is a non-service applet, the label information is 1, and when the training applet is a service applet, the label information is 0.
  • the first candidate classifier model includes but is not limited to logistic regression, support vector machine, decision tree, naive Bayes, K nearest neighbor, baggingK nearest neighbor, bagging decision tree, random forest, adaboost, gradient boosting decision tree.
  • Step S002 Determine performance parameters corresponding to the M first candidate classifier models based on the first training data set.
  • an S-fold cross-validation method (for example, a ten-fold cross-validation method) may be used to determine the performance parameters of the M first candidate classifier models under the first training data set.
  • performance parameters include but are not limited to accuracy, precision, recall, F1-score, ROC, AUC.
  • accuracy and recall rate can be determined for each first candidate classifier model.
  • Step S003 Determine K first classifier models based on the performance parameters corresponding to the M first candidate classifier models.
  • step S002 for each first candidate classifier model, a performance parameter is determined, then based on the performance parameter, determine the K first classifier models with the best performance. For example, for each first candidate classifier model, the performance parameter of accuracy is determined. Then, when step S003 is implemented, the accuracy of the M first candidate classifier models may be sorted. Thus, the K first classifier models with the highest accuracy are selected from the M first candidate classifier models.
  • step S002 for each first candidate classifier model, at least two performance parameters are determined, when step S003 is implemented, one of the most concerned performance parameters may be selected from the at least two performance parameters, and then Based on the most concerned performance parameters, K first classifier models are determined from the M first candidate classifier models; it can also be selected from at least two performance parameters of the most concerned performance parameters, and then more Perform arithmetic average or weighted average on the most concerned performance parameters, or perform a summation operation, thereby determining K first classifier models from the M first candidate classifier models.
  • Step S004 Use the first training data set to train the K first classifier models to obtain K trained first classifier models.
  • step S004 the dynamic features of multiple training applets in the first training data set are input into K first classifier models respectively, and the training prediction values are obtained correspondingly, and then according to the training prediction value and each training
  • the label information of the applet determines the difference between the training prediction value and the actual label information.
  • the parameters of the K first classifier models are adjusted until the preset training completion conditions are reached, thereby obtaining K trained first classifiers Model.
  • the training completion condition may be that the preset number of training times is reached, or it may be that the difference between the training prediction value and the actual label information is less than the preset threshold.
  • the best performance can be selected based on the performance parameters of the M first candidate classifier models.
  • K first classifier models, and then K first classifier models are trained based on the first training data set to obtain K first classifier models that have been trained, and then use the trained K first classifier models Classify the applet to be classified to determine the classification information of the applet to be classified.
  • the trained second classifier model can also be obtained through the following steps:
  • Step S005 using the first training data set and the K first classifier models to construct a second training data set.
  • the second training data set includes: prediction information of the K first classifier models for the training applet and label information of the training applet, and the prediction information includes at least the predicted probability value that the training applet is a serviceless applet .
  • Step S006 Obtain N preset second candidate classifier models, and determine performance parameters corresponding to the N second candidate classifier models based on the second training data set.
  • N is an integer greater than 1.
  • the type of N second candidate classifier models can be determined first, and then the optimal hyperparameters of each second candidate classifier can be searched through the grid search method to obtain N The second candidate classifier model.
  • an S-fold cross-validation method (for example, a ten-fold cross-validation method) may be used to determine the performance parameters of the N second candidate classifier models under the second training data set.
  • performance parameters include but are not limited to accuracy, precision, recall, F1-score, ROC, AUC.
  • step S006 is the same as the performance parameter type determined in step S002. For example, if what is determined in step S002 is the accuracy and recall rate corresponding to the first classifier model, then what is determined in step S006 is also the accuracy and recall rate of the second classifier model.
  • Step S007 Based on the performance parameters corresponding to the N second candidate classifier models and the performance parameters corresponding to the K first classifier models, a second classifier model is determined from the N second candidate classifier models.
  • step S007 has at least the following two implementation modes:
  • the first implementation is to compare the performance parameters of the N second candidate classifier models with the performance parameters of the K first classifier models in turn. Once it is determined that the performance parameters of a second candidate classifier model are better than K First classifier model, then when the performance difference between the second candidate classifier model and the K first classifier models is greater than the preset threshold, the second candidate classifier model is determined to be the second class ⁇ model.
  • the performance parameters of the N second candidate classifier models are all better than the K first classifier models, and the second candidate classifier model with the best performance parameters is determined as the second classifier model.
  • the performance parameters of the N second candidate classifier models with the performance parameters of the K first classifier models in turn. Once the performance parameters of a second candidate classifier model are determined to be equal It is better than the K first classifier models, and the second candidate classifier model is determined as the second classifier model.
  • Step S008 Use the second training data set to train the second classifier model to obtain the trained second classifier model.
  • the prediction information of the K first classifier models on the training applet in the second training data set may be input into the second classifier model to obtain the second classifier model on the training applet. Then, according to the training prediction value and the label information of each training applet, the difference between the training prediction value and the actual label information is determined to adjust the parameters of the second classifier model until the preset training completion condition is reached, thereby Get the trained second classifier model.
  • the training completion condition may be that the preset number of training times is reached, or it may be that the difference between the training prediction value and the actual label information is less than the preset threshold.
  • Step S051 Divide the first training data set into P first training data subsets.
  • P is an integer greater than 1.
  • the first training data set is divided into 10 first training data subsets.
  • Step S052 Determine the i-th first training data subset as the i-th test data set.
  • i 1,2,...,P.
  • Step S053 Use other first training data subsets to train the K first classifier models to obtain K trained first classifier models.
  • the other first training data subsets are (P-1) first training data subsets other than the i-th first training data subset.
  • Step S054 Perform prediction processing on the i-th test data set by using the K trained first classifier models to obtain prediction information of the K first classifier models for training the applet in the i-th test data set.
  • step S052 to step S054 are executed P times in a loop, so as to obtain prediction information of the K first classifier models for the training applet in the first to Pth test data sets.
  • Step S055 Determine the prediction information of the K first classifier models on the training applet in the first to Pth test data sets and the label information of the training applet as the second training data set.
  • each training applet corresponds to K training prediction values
  • the K training prediction values represent the predicted probability values of the K first classifier models that the training applet is a serviceless applet.
  • step S007 shown in FIG. 4B can be implemented in the following two ways:
  • the first implementation can be achieved through the following steps:
  • step S071A the performance parameters corresponding to the N second candidate classifier models and the performance parameters corresponding to the K first classifier models are sequentially compared.
  • Step S072A when it is determined that the performance parameters corresponding to the jth second candidate classifier model are better than the performance parameters corresponding to the K first classifier models, determine the jth second candidate classifier model and the K The value of the performance difference between the first classifier models.
  • Step S073A When the performance difference value is greater than the preset difference threshold, the jth second candidate classifier model is determined as the second classifier model.
  • j is an integer between 1 and N.
  • the second classifier model can be determined without comparing all N second classifier models, but in the first implementation manner, the determined second classifier model cannot be guaranteed
  • the model is the best performance among the N second candidate classifier models.
  • the second implementation can be achieved through the following steps:
  • Step S071B Determine Q second target classifier models based on the performance parameters corresponding to the N second candidate classifier models and the performance parameters corresponding to the K first classifier models.
  • the performance parameters corresponding to the second target classifier model are better than the performance parameters corresponding to the K first classifier models.
  • Step S072B respectively determine Q performance difference values between the Q second target classifier models and the K first classifier models.
  • Step S073B based on the Q performance difference values, determine a second classifier model from the Q second target classifier models.
  • step S073B when implemented, it may be based on Q performance difference values to determine the best performance from the Q second target classifier models as the second classifier model.
  • the second implementation method firstly, from the N second candidate classifier models, determine all the Q second target classifier models whose performance is better than the K first classifier models, and then use the Q second target classifier models.
  • the one with the best performance is determined from the second target classifier model as the second classifier model.
  • the second implementation is more computationally intensive, but can determine the second classifier model with the best performance. In practical applications, you can determine whether to adopt the first implementation method or the second implementation method according to your actual needs.
  • the applet is sent to the server for review before the applet is put on the shelf, and the server reviews the applet after receiving the applet file. After passing the review, run the applet and trigger various events in turn, acquire the dynamic characteristics of the applet during its operation, and then classify the applet according to the dynamic characteristics, and store the classification information of the applet.
  • the classification information of the applet can be No service applet or service applet.
  • the server may publish the applet to the applet store after the applet is reviewed, and then classify the applet based on the dynamic characteristics of the applet after publishing, and store the classification information of the applet.
  • FIG. 5 is a schematic diagram of another implementation process of the small program classification method provided by the embodiment of the application. The following describes each step in conjunction with FIG. 5.
  • step S501 the developer terminal obtains the applet code edited by the applet developer.
  • the developer terminal is provided with an applet development environment or an applet development platform, and an applet developer can edit the applet code through the applet development environment or the applet development platform to realize applet development.
  • step S502 the developer terminal sends the developed small program code to the server based on the received upload operation.
  • step S502 when step S502 is implemented, after the applet developer completes the applet code development, it is packaged into one or more JavaScript files that can be run in the browser environment of the client and uploaded to the server.
  • step S503 the server reviews the mini program, and when the review passes, the mini program is put on the shelf.
  • the server's review of the mini program can be to check whether the content of the mini program complies with the rules, such as whether it involves rumors, fraud, gambling and other illegal content, and it can also review whether the mini program code has defects (Bugs) and whether the functions are complete, etc. Wait. After the applet has passed the review, publish the applet to the applet store or navigation website, that is, users can search for and use the applet.
  • step S504 the server runs the code of the applet to obtain the dynamic characteristics of the applet during the running process.
  • step S505 the server inputs the dynamic feature into the K first classifier models respectively, and obtains K initial prediction values correspondingly.
  • K is an integer greater than 1.
  • the first classifier model corresponds to the base classifier in other embodiments, and the initial prediction value is the probability value that the applet is a no-service applet, which is a real number between 0 and 1.
  • step S506 the server inputs the K initial prediction values to the second classifier model to perform integration processing on the K initial prediction values to obtain the target prediction value.
  • the second classifier model corresponds to the integrated classifier in other embodiments, and is used to perform integrated processing on the K initial prediction values, so as to obtain the final target prediction value.
  • Step S507 The server determines the classification information of the applet based on the target predicted value and the preset classification threshold.
  • the classification information of the applet to be classified can be determined by judging the relationship between the target predicted value and the classification threshold.
  • the classification information of the applet to be classified can be determined
  • the classification information is the first type of applet; when the target predicted value is less than or equal to the classification threshold, it is determined that the classification information of the applet to be classified is the second type of applet, and the first type of applet is a no-service applet.
  • the second type of applet is a service applet.
  • step S508 the server stores the classification information of the applet.
  • the corresponding relationship between the identification of the applet and the classification information may be stored, or the classification information may be stored as an attribute information of the applet.
  • the above steps S504 to S508 may be executed after the applet is reviewed and before the applet is released, that is, the classification information of the applet is determined before the applet is released, and the classification information of the applet is determined.
  • Publish to the applet store together or successively with the applet file.
  • step S509 the user terminal obtains a search keyword in response to the search operation of the applet.
  • step S510 the user terminal sends a search request to the server.
  • the search request carries search keywords.
  • step S511 the server performs a search based on the search keyword in the search request, and obtains the first search result.
  • the server determines the applet identifier matching the search keyword from the applet identifiers stored in the server based on the search keyword, and determines the applet identifier as the first search result.
  • Step S512 The server obtains classification information corresponding to each applet identifier in the first search result.
  • step S513 the server deletes the small program identifiers whose classification information is the no-service small program from the first search result according to the classification information corresponding to each small program identifier, and obtains the second search result.
  • step S514 the server returns the second search result to the user terminal.
  • the mini program code is sent to the server for review by the server, and after the review is passed, the mini program is put on the shelf.
  • the server It will also run the applet, and trigger various preset events in turn to obtain the dynamic characteristics of the applet during its operation, and then determine the classification of the applet according to the dynamic characteristics of the applet and the trained first classification model and second classification model.
  • the server determines the first applet that matches the search keyword, and then obtains the classification information of each first applet according to the classification information Delete the unserviced applet, and return the search result of the deleted unserviced applet to the user terminal, so as to ensure that the search results finally obtained by the user are all the applet that can provide services.
  • the applet in the process of analyzing the content profile of the applet, the applet needs to be classified to determine whether the applet is a serviceless applet or a service applet.
  • "no service” means that the service provided is to display some basic information, such as company introduction or resume display, but no other actual services.
  • Step S601 Obtain the original dynamic characteristics of the applet.
  • the dynamic feature of the applet is relative to the static feature.
  • the static characteristics of the applet refer to the characteristics that can be mined from the static code of the applet, such as the number of controls, Document Object Model (DOM, Document Object Model), and custom components. However, the elements written in the static code may not necessarily be Presented or called, so the page actually presented to the user by the applet and its static code do not correspond to each other.
  • the dynamic characteristics of the applet refer to the characteristics that can be obtained when the applet code is dynamically run after rendering, such as which events can be triggered, the JS API that is called when the event is triggered, the controls bound to the event, etc. In fact, it is to simulate the interaction between the user and the page. Therefore, dynamic features reflect the real user interaction experience more than static features.
  • the applet code when acquiring the original dynamic characteristics of the applet, the applet code is rendered and dynamically run, each event is triggered in turn, the JS API called when each event is triggered, and statistics can be bound to the trigger event
  • the JS API called when each event is triggered
  • statistics can be bound to the trigger event
  • Step S602 extract and construct effective dynamic features.
  • the extracted dynamic features can include statistical features, API features, and image features, where the statistical features can include the number of triggerable events, the number of APIs, and the number of interactive controls;
  • the API features can include: each API is called The total number of times, for example, can include the total number of times that the API for obtaining system information is called, the total number of times that the API for scanning the QR code is called, the total number of times that the API for displaying the message prompt box is called, etc.;
  • the image feature can include the page when the event is not triggered Screenshots and screenshots of the page after triggering all events.
  • Step S603 Classify the presence or absence of services based on the dynamic characteristics of the applet.
  • the obtained dynamic feature of the applet may be input into the trained multiple base classification models to obtain each predicted value of each base classification model for the applet, where the predicted value is The applet is the probability value of the no-service applet; then each predicted value is input to the trained integrated classifier model to integrate each predicted value through the integrated classifier model to obtain the final predicted value.
  • the predicted value is also the probability value that the applet is a no-service applet.
  • the final predicted value is compared with the preset classification threshold to obtain the classification information of the applet. For example, if the final predicted value is greater than the classification threshold, the applet is judged as no service. If the final predicted value is less than the classification threshold, it is judged that it is not a serviceless applet.
  • Table 1 shows the dynamic features of the mini program extracted from the json file in this embodiment of the application:
  • Table 1 above shows the feature category, feature name, and variable name of the dynamic feature that needs to be extracted from the json file.
  • step S602 when step S602 is implemented, the statistical features and JS API features in Table 1 and the Hash_diff in Table 2 are retained, and Pic_0 and Pic_1 in Table 1 and Hash_0 and Hash_1 in Table 2 are removed to obtain 22 One-dimensional dynamic feature.
  • the 22-dimensional dynamic feature is used as input information in step S603 and is input into the classifier model to determine the classification information of the applet.
  • step S603 the trained base classifier model and the ensemble classifier model need to be obtained through the following steps:
  • Step S701 Collect data with label information, and construct a training set (X, y).
  • the label information is used to characterize whether the applet is a serviceless applet or a service applet.
  • the label information can be 0 or 1.
  • the label information is 1, it means that the applet is a serviceless applet, and when the label information is 0 means that the applet is a service applet.
  • the data with label information may be a small program code.
  • step S701 when step S701 is implemented, after the small program code with label information is collected, the small program code is run and rendered, and the dynamic characteristics of the small program are obtained, and then the small program
  • the training set (X, y) is constructed with the label information and dynamic features of. Among them, X ⁇ R n ⁇ 22 , y ⁇ 0,1 ⁇ n ⁇ 1 , n means that there are n samples, and 22 is the 22-dimensional feature extracted in step S602.
  • Step S702 construct m classifiers.
  • the m classifiers include, but are not limited to, logistic regression, support vector machine, decision tree, naive Bayes, K nearest neighbor, baggingK nearest neighbor, bagging decision tree, random forest, adaboost, and gradient boosting decision tree.
  • the grid search method can be used to search for the optimal hyperparameters of each classifier, and under the optimal hyperparameters, ten-fold cross validation is used to evaluate the performance of each classifier under the training set (X, y).
  • Performance indicators include but are not limited to accuracy, precision, recall, F1-score, ROC and AUC.
  • step S703 the k classifiers with the best performance under the most concerned index are selected from the m classifiers as the base classifiers, and a new classifier is added to stack the base classifiers.
  • the most concerned indicator is the recall rate
  • the types of newly added classifiers include, but are not limited to, logistic regression, support vector machines, decision trees, naive Bayes, K nearest neighbors, baggingK nearest neighbors, bagging decision trees, random forests, adaboost, and gradient boosting decision trees.
  • step S703 can be implemented through the following steps:
  • step S7031 the training set (X, y) is divided into 10 parts. Under each base classifier, 9 of them are taken as the training set to train the base classifier, and the remaining 1 part is input to the base classifier as the test set. The predictive processing is performed in the classifier to obtain the probability value of no service, which is repeated 10 times.
  • the original training set (X, y) is transformed into (X 1 , y) via k classifiers, where X 1 ⁇ R n ⁇ k .
  • step S7032 a new classifier is added, and the prediction results of the base classifier are integrated.
  • the type of candidate new classifiers can be preset when step S7032 is implemented.
  • the search method searches for the optimal hyperparameters of each candidate new classifier, and uses ten-fold cross-validation under the optimal hyperparameters to evaluate the performance of each candidate new classifier under the training set (X 1 , y).
  • the candidate new classifier can be determined as an integrated classifier, and (X, y) is used to train each base classifier, and (X 1 , y) is used to train the integrated classifier, and the sequence Save the trained base classifiers and integrated classifiers.
  • the performance parameters of each candidate new classifier are determined, the performance parameters of each candidate classifier are compared with the performance parameters of each base classifier, and the performance parameters of each candidate new classifier are compared.
  • the candidate new classifiers that are optimal for each base classifier are determined as the integrated classifiers.
  • the dynamic characteristics of the mini program in the running process are obtained, so that the presence or absence of services is classified based on the dynamic characteristics of the mini program, which can avoid the static code of the mini program and the display of online pages. Misjudgments and omissions caused by the differences of single features and the limitations of single features according to rules, thereby improving the overall classification performance.
  • the applet classification device 344 provided by the embodiments of the present application is implemented as a software module.
  • the software modules stored in the applet classification device 344 of the memory 340 may be It is a small program classification device in the server 300, including:
  • the first obtaining module 3441 is configured to obtain the applet code of the applet to be classified
  • the running module 3442 is configured to run the code of the applet, and obtain the dynamic characteristics of the applet to be classified in the running process;
  • the first determining module 3443 is configured to input the dynamic feature into the trained classifier model to obtain classification information of the applet to be classified;
  • the storage module 3444 is configured to store classification information of the applet to be classified.
  • the running module 3442 is further configured to:
  • Trigger each preset event in turn, obtain the successfully triggered target event, and obtain the application program interface information that is called when the target event is triggered and the control information corresponding to the target event;
  • the application program interface information, the control information, the first applet interface image and the second applet interface image Based on the number of target events, the application program interface information, the control information, the first applet interface image and the second applet interface image, the dynamic characteristics of the applet to be classified in the running process are determined.
  • the dynamic feature is determined based on the number of the target event, the total number of calls of each application program interface, the number of interactive controls, and image difference information.
  • the trained classifier model includes at least K first classifier models that have been trained.
  • the first determining module 3443 is further configured to:
  • the classification information of the applet to be classified is determined.
  • the trained classifier model further includes a trained second classifier model.
  • the first determining module 3443 is further configured to:
  • the K initial predicted values are input to the second classifier model to perform integrated processing on the K initial predicted values to obtain the target predicted value.
  • the first determining module 3443 is further configured to:
  • the classification information of the applet to be classified is the first type of applet
  • the classification information of the applet to be classified is the second type of applet.
  • the device further includes:
  • the second acquisition module is configured to acquire a first training data set and preset M first candidate classifier models, wherein the first training data set includes the dynamic characteristics of the training applet and the label information of the training applet, M is an integer greater than 2;
  • a second determining module configured to determine performance parameters corresponding to the M first candidate classifier models based on the first training data set
  • the third determining module is configured to determine K first classifier models based on the performance parameters corresponding to the M first candidate classifier models;
  • the first training module is configured to train the K first classifier models by using the first training data set to obtain K trained first classifier models.
  • the device further includes:
  • the data construction module is configured to use the first training data set and the K first classifier models to construct a second training data set, where the second training data set includes: the K first classifier model pairs are trained The prediction information of the applet and the label information of the training applet;
  • the third acquisition module is configured to acquire N preset second candidate classifier models, and determine the performance parameters corresponding to the N second candidate classifier models based on the second training data set, where N is greater than 1.
  • the fourth determining module is configured to determine the second from the N second candidate classifier models based on the performance parameters corresponding to the N second candidate classifier models and the performance parameters corresponding to the K first classifier models.
  • Classifier model
  • the second training module is configured to train the second classifier model by using the second training data set to obtain the trained second classifier model.
  • the data building module is further configured as:
  • the prediction information of the K first classifier models for the training applet in the first to Pth test data sets and the label information of the training applet are determined as the second training data set.
  • the fourth determining module is further configured to:
  • the j-th second candidate classifier model is determined as the second classifier model, where j is an integer between 1 and N.
  • Q second target classifier models are determined, and the performance parameters corresponding to the second target classifier models All are better than the performance parameters corresponding to the K first classifier models;
  • a second classifier model is determined from the Q second target classifier models.
  • the embodiment of the present application provides a storage medium storing executable instructions, and the executable instructions are stored therein.
  • the processor will cause the processor to execute the method provided in the embodiments of the present application, for example, as shown in FIG. 3.
  • the embodiment of the present application provides a computer program product or computer program, the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the applet classification method described in the embodiment of the present application.
  • the storage medium may be a computer-readable storage medium, for example, Ferromagnetic Random Access Memory (FRAM), Read Only Memory (ROM), and Programmable Read Only Memory (PROM). Read Only Memory), Erasable Programmable Read Only Memory (EPROM, Erasable Programmable Read Only Memory), Electrically Erasable Programmable Read Only Memory (EEPROM, Electrically Erasable Programmable Read Only Memory), flash memory, magnetic surface memory, optical disks, Or CD-ROM (Compact Disk-Read Only Memory) and other memories; it can also be a variety of devices including one or any combination of the foregoing memories.
  • FRAM Ferromagnetic Random Access Memory
  • ROM Read Only Memory
  • PROM Programmable Read Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • flash memory magnetic surface memory, optical disks, Or CD-ROM (Compact Disk-Read Only Memory) and other memories; it can also be a variety of devices including one or any combination of the foregoing
  • executable instructions may be in the form of programs, software, software modules, scripts or codes, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and their It can be deployed in any form, including being deployed as an independent program or as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may but do not necessarily correspond to files in the file system, and may be stored as part of a file that saves other programs or data, for example, in a HyperText Markup Language (HTML, HyperText Markup Language) document
  • HTML HyperText Markup Language
  • One or more scripts in are stored in a single file dedicated to the program in question, or in multiple coordinated files (for example, a file storing one or more modules, subroutines, or code parts).
  • executable instructions can be deployed to be executed on one computing device, or on multiple computing devices located in one location, or on multiple computing devices that are distributed in multiple locations and interconnected by a communication network Executed on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

一种小程序分类方法、装置、设备及计算机可读存储介质,其中,方法包括:获取待分类小程序的小程序代码(S101);运行该小程序代码,获取该待分类小程序在运行过程中的动态特征(S102);将该动态特征输入训练好的分类器模型,得到该待分类小程序的分类信息(S103);存储该待分类小程序的分类信息(S104)。

Description

小程序分类方法、装置、设备及计算机可读存储介质
相关申请的交叉引用
本申请基于申请号为202010583738.0、申请日为2020年06月23日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及小程序分类技术领域,涉及但不限于一种小程序分类方法、装置、设备及计算机可读存储介质。
背景技术
小程序,是一种介于传统H5网页,和传统原生Android/IOS应用之间的应用形态。小程序不需要下载安装即可使用,相对于专用客户端节省了安装过程,实现了应用“触手可及”的梦想,因此有着非常广阔的使用者和开发者。
目前小程序可以分为无服务小程序和有服务小程序,其中,无服务小程序是指只能展示一些基本信息,如企业介绍或简历展示等,而不提供其他实际服务的小程序。有服务小程序是指能提供预约服务、点餐服务、签到服务等实际服务的小程序。为了在用户搜索小程序时给用户展现能够提供实际服务的小程序,因为在小程序上架时可以进行分类,以识别出无服务小程序。
发明内容
本申请实施例提供一种小程序分类方法、装置、设备及计算机可读存储介质,通过小程序的动态特征对小程序进行分类,能够提高分类结果的准确率。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种小程序分类方法,所述方法应用于小程序分类设备,包括:
获取待分类小程序的小程序代码;
运行该小程序代码,获取该待分类小程序在运行过程中的动态特征;
将该动态特征输入训练好的分类器模型,得到该待分类小程序的分类信息;
存储该待分类小程序的分类信息。
本申请实施例提供一种小程序分类装置,包括:
第一获取模块,配置为获取待分类小程序的小程序代码;
运行模块,配置为运行该小程序代码,获取该待分类小程序在运行过程中的动态特征;
第一确定模块,配置为将该动态特征输入训练好的分类器模型,得到该待分类小程序的分类信息;
存储模块,配置为存储该待分类小程序的分类信息。
本申请实施例提供一种小程序分类设备,包括:
存储器,配置为存储可执行指令;处理器,配置为执行该存储器中存储的可执行指令时,实现上述的小程序分类方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现上述的小程序分类方法。
本申请实施例具有以下有益效果:
在获取到待分类小程序的小程序代码后,运行该小程序代码,获取该待分类小程序在运行过程中的动态特征,进而将该动态特征输入训练好的分类器模型,确定该待分类小程序的分类信息并存储该分类信息。由于动态特征是小程序运行过程中提取的,能够反映小程序在使用过程实际表现特征,进而在利用小程序的动态特征对小程序进行分类,能够提高分类结果的准确率。
附图说明
图1为本申请实施例提供的小程序分类系统的一个网络架构示意图;
图2为本申请实施例提供的服务器300的结构示意图;
图3为本申请实施例提供的小程序分类方法的一种实现流程示意图;
图4A为本申请实施例提供的获取训练好的分类器模型的一种实现流程示意图;
图4B为本申请实施例提供的获取训练好的分类器模型的另一种实现流程示意图;
图5为本申请实施例提供的小程序分类方法的另一种实现流程示意图。
具体实施方式
为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请 实施例进行说明和描述,所描述的实施例不应视为对本申请实施例的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请实施例保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。除非另有定义,本申请实施例所使用的所有的技术和科学术语与属于本申请实施例的技术领域的技术人员通常理解的含义相同。本申请实施例所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
1)小程序,又可以称为网络应用,由客户端(例如浏览器或内嵌浏览器核心的任意客户端)经由网络(如互联网)下载、并在客户端的浏览器环境中解释和执行的软件。是一种介于传统H5网页和传统原生Android/IOS应用之间的一种应用形态;例如,在社交网络客户端中可以下载、运行用于实现机票购买、乘车码等各种服务的网络应用。
2)准确率,评价分类模型或者机器学习模型性能的一种指标,用预测正确的结果占总样本的百分比来表示。
准确率的表达式为:准确率=(TP+TN)/(TP+TN+FP+FN);其中,TP为预测为1,实际为1,预测正确;FP为预测为1,实际为0,预测错误;FN为预测为0,实际为1,预测错误;TN为预测为0,实际为0,预测正确。
虽然准确率能够判断总的正确率,但是在样本不均衡的情况下,并不能作为很好的指标来衡量结果。
3)精确率,又可称为精确度,精确率(Precision)是针对预测结果而言的,其含义是在被所有预测为正的样本中实际为正样本的概率。
精确度的表达式为:精确率=TP/(TP+FP)。
4)召回率,是针对原样本而言的,其含义是在实际为正的样本中被预测为正样本的概率。
召回率的表达式为:召回率=TP/(TP+FN)。
5)F1分数(F1-Score),F1分数同时考虑精确率和召回率,让两者同时达到最高,取得平衡。
F1分数表达式为:F1分数=2*精确率*召回率/(精确率+召回率)。
6)受试者工作特征曲线(ROC,Receiver Operating Characteristic)曲线,被用来评价一个二值分类器(binary classifier)的优劣。相比准确率、召回率、F-score这样的评价指标,ROC曲线有这样一个很好的特性:当测试集中正负样本的分布变化的时候,ROC曲线能够保持不变。
7)ROC曲线下面积(AUC,Area Under Curve),表示ROC中曲线下的面积,用于判断模型的优劣。如ROC曲线所示,连接对角线的面积刚好是0.5,对角线的含义也就是随机判断预测结果,正负样本覆盖应该都是50%。另外,ROC曲线越陡越好,所以理想值是1,即正方形。所以AUC的值一般是介于0.5和1之间的。
为了更好地理解本申请实施例中提供的小程序分类方法,首先对相关技术中的小程序分类方法进行说明:
相关技术中,在进行小程序分类时,采取的方法是基于静态统计特征和规则的方法,也即对小程序静态代码里的按键数进行统计,取出按键数小于指定值的作为无服务的小程序。
由于小程序静态代码和线上页面展示的差异、以及单特征按规则划分的局限性,该分类方法会导致较多的误判和漏过。
基于此,在本申请实施例中提出一种基于动态特征的小程序分类方法,在获取到小程序的源代码之后,执行源代码以运行小程序,从而获取小程序原始动态特征,提取并构造有效的动态特征,在基于小程序动态特征对小程序进行分类。
下面说明本申请实施例提供的小程序分类设备的示例性应用,本申请实施例提供的小程序分类设备可以实施为笔记本电脑,平板电脑,台式计算机,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)、智能机器人等任意具有屏幕显示功能的终端,也可以实施为服务器。下面,将说明小程序分类设备实施为服务器时的示例性应用。
参见图1,图1为本申请实施例提供的小程序分类系统的一个网络架构示意图。如图1所示,该小程序分类系统中包括用户终端100、开发者终端200和服务器300。小程序的开发者在开发者终端200(例如电脑等用户终端)部署小程序的开发框架完成针对小程序的代码开发,小程序可以用于实现各种服务方提供的服务,例如,乘车码服务、快递服务和线上购物等,开发框架中提供有小程序的构建工具,以将小程序的项目中的代码封装成一个或多个能够在客户端的浏览器环境中运行的JavaScript文件,并上传到 服务器300,以请求评审并在服务器300评审通过后上架,服务器300可以是承载业务方的业务逻辑的服务器,例如,承载乘车服务提供方的乘车服务的后台服务器。在图1中示例性地,将小程序存储至第一客户端对应的服务器300中。在一些实施例中,服务器300也可以是专用的存储服务器,例如,内容分发网络(CDN,Content Delivery Network)中与用户的终端最短链路的节点。
服务器300在接收到小程序文件后,对小程序进行评审,并在评审通过后运行该小程序并依次触发各个事件,获取小程序运行过程中的动态特征,进而根据动态特征对小程序进行分类,在一些实施例中,分为无服务小程序和有服务小程序。当服务器300接收到用户终端100发送的小程序搜索请求后,基于搜索请求中携带的搜索关键字查询与搜索关键字匹配的小程序,并获取匹配出的各个小程序的分类信息,当匹配出的小程序的分类信息为无服务小程序时,可以如图1所示将这些无服务小程序过滤掉,并向用户终端100返回仅包括有服务小程序的搜索结果。或者,在一些实施例中,服务器300可以将这些无服务小程序排在有服务小程序之后,并将排序后的小程序返回给用户终端100。
在一些实施例中,用户终端100可以向服务器300发送访问小程序商店的访问请求,服务器300响应于该访问请求,确定用户终端100常用的小程序的类型,这里的类型可以是指游戏类、购物类、出行类等,并基于确定出的常用小程序类型,确定与用户常用小程序类型匹配的有服务小程序,并携带于访问响应中返回给用户终端100,便于用户使用。
参见图2,图2为本申请实施例提供的服务器300的结构示意图,图2所示的服务器300包括:至少一个处理器310、存储器340、至少一个网络接口320。服务器300中的各个组件通过总线系统330耦合在一起。可理解,总线系统330用于实现这些组件之间的连接通信。总线系统330除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统330。
处理器310可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
存储器340可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器340可选地包括在物理位置上远离处理器3 10的一个或多个存储设备。存储器340包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器340旨在包括任意适合类型的存储器。在一些实施例中,存储器340能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统341,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块342,配置为经由一个或多个(有线或无线)网络接口320到达其他计算设备,示例性的网络接口320包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
输入处理模块343,配置为对一个或多个来自一个或多个输入装置332之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的装置可以采用软件方式实现,图2示出了存储在存储器340中的小程序分类装置344,该小程序分类装置344可以是服务器300中的小程序分类装置,其可以是程序和插件等形式的软件,包括以下软件模块:第一获取模块3441、运行模块3442、第一确定模块3443和存储模块3444,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或拆分。将在下文中说明各个模块的功能。
在另一些实施例中,本申请实施例提供的装置可以采用硬件方式实现,作为示例,本申请实施例提供的装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的小程序分类方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。
下面将结合本申请实施例提供的服务器300的示例性应用和实施,说明本申请实施例提供的小程序分类方法。本申请实施例提供一种小程序分类方法,应用于服务器,参见图3,图3为本申请实施例提供的小程序分类方法的一种实现流程示意图,将结合图3示出的步骤进行说明。
步骤S101,获取待分类小程序的小程序代码。
这里,待分类小程序可以是由小程序开发者在开发完小程序代码后提交到服务器的。步骤S101在实现时,可以是服务器每间隔一段时间,获取一次待分类小程序,例如可以是每间隔12个小时,获取12个小时之前这一历史时刻至当前时刻这一段时间内接收到的待分类小程序。在一些实施例中,步骤S101在实现时,也可以是每当服务器接收到开发者终端提交的小程序,即确定为待分类小程序,并执行后续步骤,也就是说待分类小程序可以是实时获取的。
步骤S102,运行该小程序代码,获取该待分类小程序在运行过程中的动态特征。
步骤S102在实现时,可以是对小程序代码进行渲染,并动态运行,在小程序运行过程中触发各个预设的事件,并记录触发各事件时所调用的JS API,统计触发成功的事件所绑定的控件个数,对未开始触发事件以及触发完所有事件的小程序页面分别进行截图并进行base64编码,获取到的以上数据进行合并,并保存为json文件,进而再通过该json文件,提取出待分类小程序的动态特征。
在本申请实施例中,小程序的动态特征可以包括统计特征、API特征以及图像特征,其中,统计特征可以包括触发成功的事件数、API个数、可交互控件个数;API特征包括各个API被调用的总次数;图像特征可以包括未开始触发事件以及触发完所有事件的小程序页面截图之间的差异信息。
可交互控件可以是指小程序运行过程中,能够响应触控操作或点击操作而呈现页面跳转或者呈现交互页面的控件。
步骤S103,将该动态特征输入训练好的分类器模型,得到该待分类小程序的分类信息。
这里,训练好的分类器模型可以包括但不限于是逻辑回归、支持向量机、决策树、朴素贝叶斯、K近邻、baggingK近邻、bagging决策树、随机森林、adaboost、梯度提升决策树。在本申请实施例中,训练好的分类模型可以仅包括训练好的一个或多个基分类器,此时步骤S103在实现时,可以将动态特征输入到训练好的一个或多个基分类器中,分别对应得到一个或多个初始预测值,然后根据一个或多个初始预测值确定目标预测值。在实际应用中,当得到多个初始预测值后,可以将多个初始预测值进行均值计算,得到目标预测值。其中目标预测值为待分类小程序为无服务小程序的概率,最后根据目标预测值确定待分类小程序的分类信息。该分类信息可以为有服务小程序,还可以为无服务小程序。
在一些实施例中,训练好的分类器模型还可以不仅包括训练好的多个基分类器,还 可以包括一个训练好的集成分类器,此时步骤S103在实现时,可以将动态特征输入到训练好的多个基分类器中,分别对应得到多个初始预测值,然后将多个初始预测值输入至集成分类器中进行数据集成,得到目标预测值,最后根据目标预测值确定待分类小程序的分类信息。
步骤S104,存储该待分类小程序的分类信息。
这里,步骤S104在实现时,可以是存储待分类小程序的标识和分类信息的对应关系,还可以是将该分类信息作为待分类小程序的一个属性信息进行存储。
在本申请实施例提供的小程序分类方法中,在获取到待分类小程序的小程序代码后,运行该小程序代码,以获取该待分类小程序在运行过程中的动态特征,进而基于该动态特征和训练好的分类器模型,确定该待分类小程序的分类信息并存储该分类信息。由于动态特征是小程序运行过程中提取的,能够反映小程序在使用过程实际能够触发的事件、调用的API等,进而在利用小程序的动态特征对小程序进行分类,能够提高分类结果的召回率。
在一些实施例中,图3所示的步骤S102可以通过以下步骤实现:
步骤S1021,运行该小程序代码,获取第一小程序界面图像。
这里,步骤S1021在实现时可以渲染并运行小程序代码,此时可以在服务器的显示界面中渲染出小程序界面,获取当前的第一小程序界面图像,该第一小程序界面图像也即没有触发任何事件之前的界面图像。
步骤S1022,依次触发各个预设事件,获取触发成功的目标事件,并获取触发该目标事件时调用的应用程序接口信息和该目标事件对应的控件信息。
这里,该预设事件可以是对待分类小程序中各个控件的单击事件、双击事件、长按事件。步骤S1022在实现时,首先基于小程序界面图像识别该小程序界面图像中的各个控件,然后基于各个控件触发预设事件,获取触发成功的目标事件,并获取触发该目标事件时调用的应用程序接口信息,其中应用程序接口信息至少包括应用程序接口标识和该应用程序接口被调用的次数。在本申请实施例中还会获取目标事件对应的控件信息,其中,该目标事件是通过哪些控件触发成功的,那么该控件的属性信息也即为目标事件对应的控件信息。该控件信息至少可以包括控件标识。
在本申请实施例中,在成功触发一个事件后,小程序界面可能会发生变化,因此在成功触发一个事件后,可以再次获取小程序界面图像,并识别当前的小程序界面图像中的控件,进而再针对当前小程序界面图像中的控件触发预设事件,并获取触发成功的目 标事件以及触发该目标事件时调用的应用程序接口信息和该目标事件对应的控件信息。
步骤S1023,触发完该各个预设事件后,获取第二小程序界面图像。
这里,当触发完所有的预设事件后,进行屏幕截图,以截取第二小程序界面图像。
步骤S1024,基于目标事件的数量、该应用程序接口信息、控件信息、第一小程序界面图像和第二小程序界面图像确定该待分类小程序在运行过程中的动态特征。
这里,步骤S1024在实现时,可以基于该应用程序接口信息确定各个应用程序接口的调用总次数,并基于该控件信息确定可交互控件个数,以及确定该第一小程序界面图像和第二小程序界面图像之间的图像差异信息。其中,该图像差异信息可以是计算第一小程序界面图像和第二小程序界面图像之间的汉明距离得到的。
计算第一小程序界面图像和第二小程序界面图像之间的汉明距离在实现时可以是,首先对第一小程序界面图像和第二小程序界面图像进行尺寸变换,将第一小程序界面图像和第二小程序界面图像处理为大小一致,且尺寸较小的图像,例如可以缩小到8*8,或者16*16的尺寸大小,然后进行色彩简化处理,例如可以将缩小后的图片转换为64级灰度。也就是说,所有像素点总共只有64种颜色,然后计算灰度平均值,再将简化色彩后的第一小程序图像和简化色彩后的小程序图像中每个像素点的灰度与平均值进行比较,对于大于或等于平均值的像素点记为1;小于平均值的像素点,记为0。进而再计算哈希值,得到第一小程序界面图像和第二小程序界面图像分别对应的哈希值,然后将两个哈希值进行异或计算,从而得到计算第一小程序界面图像和第二小程序界面图像之间的汉明距离。
在得到各个应用程序接口的调用总次数之后,还可以确定该待分类小程序所调用的应用程序接口总数,在本申请实施例中,将目标事件的数量、各个应用程序接口的调用总次数、应用程序接口总数、可交互控件个数、图像差异信息确定为待分类小程序的动态特征。
在步骤S1021至步骤S1024所在的实施例中,通过运行小程序,来获取小程序的应用程序接口总数、可交互控件个数以及各个应用程序接口的调用总次数和可触发的目标事件的数量等动态特征,从而能够保证得到的特征能够真实反映小程序的实际情况,因此能够在利用动态特征确定小程序的分类信息时,提高分类结果的召回率。
在一些实施例中,该训练好的分类器模型至少包括训练好的K个第一分类器模型,对应地,步骤S103可以通过以下步骤实现:
步骤S1031,将该动态特征分别输入至K个第一分类器模型中,对应得到K个初始 预测值。
其中K为正整数,也即K为大于或者等于1的整数。第一分类器模型可以包括但不限于是逻辑回归、支持向量机、决策树、朴素贝叶斯、K近邻、baggingK近邻、bagging决策树、随机森林、adaboost、梯度提升决策树。将该待分类小程序的动态特征分别输入至训练好的K个第一分类器模型中,该K个第一分类器模型对待分类小程序进行预测处理,对应得到K个初始预测值。该初始预测值为该待分类小程序为无服务小程序的初始概率值,为0到1之间的实数。
步骤S1032,基于该K个初始预测值确定目标预测值。
这里,步骤S1032在实现时,当K为1时,那么直接将初始预测值确定为目标预测值。当K为大于1的整数时,可以对该K个初始预测值进行均值处理,得到目标预测值。其中,该均值处理可以是算术平均,也可以是加权平均。
在一些实施例中,当K为大于1的整数时,第一分类器模型可以是基分类器模型,该训练好的分类器模型还包括训练好的第二分类器模型,第二分类器模型为集成分类器模型,对应地,步骤S1032在实现时,可以是将该K个初始预测值输入至第二分类器模型,以对该K个初始预测值进行集成处理,得到目标预测值。
步骤S1033,基于该目标预测值和预设的分类阈值,确定该待分类小程序的分类信息。
这里,步骤S1033在实现时,可以通过判断目标预测值和该分类阈值的大小关系确定待分类小程序的分类信息,其中,当该目标预测值大于该分类阈值时,确定该待分类小程序的分类信息为第一类型小程序;当该目标预测值小于或者等于该分类阈值时,确定该待分类小程序的分类信息为第二类型小程序,第一类型小程序为无服务小程序,第二类型小程序为有服务小程序。
在一些实施例中,在步骤S103之前,需要获取到训练好的分类器模型,在实际实现时,可以通过图4A所示的以下步骤获取训练好的分类器模型:
步骤S001,获取第一训练数据集和预设的M个第一候选分类器模型。
其中,该第一训练数据集中包括多个训练小程序的动态特征和该多个训练小程序的标签信息,该标签信息用于表征训练小程序是无服务小程序还是有服务小程序,例如当训练小程序为无服务小程序时,该标签信息为1,当训练小程序为有服务小程序时,该标签信息为0。
M为大于1的整数,且M大于或者等于K。第一候选分类器模型包括但不限于是 逻辑回归、支持向量机、决策树、朴素贝叶斯、K近邻、baggingK近邻、bagging决策树、随机森林、adaboost、梯度提升决策树。获取M个第一候选分类器模型在实现时,可以首先确定M个第一候选分类器模型的类型,然后通过网格搜索法搜索各第一候选分类器的最优超参数,从而获取到M个第一候选分类器模型。
步骤S002,基于该第一训练数据集确定该M个第一候选分类器模型对应的性能参数。
这里,步骤S002在实现时,可以是利用S折交叉验证方法(例如可以是十折交叉验证方法)确定该M个第一候选分类器模型在第一训练数据集下的性能参数。对于每个第一候选分类器模型,可以是确定一个或者多个性能参数。其中,性能参数包括但不限于准确率、精确度、召回率、F1-score、ROC、AUC。例如可以是对于每个第一候选分类器模型,确定其精确度和召回率这两个性能参数。
步骤S003,基于该M个第一候选分类器模型对应的性能参数,确定出K个第一分类器模型。
这里,如果在步骤S002中,对于每个第一候选分类器模型,确定的是一个性能参数时,那么基于该性能参数,确定性能最好的K个第一分类器模型。举例来说,对于每个第一候选分类器模型来说,确定的是精确度这一个性能参数,那么步骤S003在实现时,可以是将M个第一候选分类器模型的精确度进行排序,从而从M个第一候选分类器模型中选出精确度最高的K个第一分类器模型。
如果在步骤S002中,对于每个第一候选分类器模型,确定的是至少两个性能参数时,步骤S003在实现时,可以是从至少两个性能参数中选择一个最为关注的性能参数,再基于该最为关注的性能参数,从M个第一候选分类器模型中确定出K个第一分类器模型;还可以是从至少两个性能参数中选择多个最为关注的性能参数,然后将多个最为关注的性能参数进行算术平均或者加权平均,或者进行求和运算,从而从M个第一候选分类器模型中确定出K个第一分类器模型。
步骤S004,利用该第一训练数据集对该K个第一分类器模型进行训练,得到K个训练好的第一分类器模型。
这里,步骤S004在实现时,可以是将第一训练数据集中多个训练小程序的动态特征分别输入到K个第一分类器模型中,对应得到训练预测值,然后根据训练预测值与各个训练小程序的标签信息确定训练预测值与实际标签信息的差值分别对K个第一分类器模型的参数进行调整,直至达到预设的训练完成条件,从而得到K个训练好的第一分 类器模型。其中训练完成条件可以是达到预设的训练次数,或者可以是训练预测值与实际标签信息的差值小于预设阈值。
通过上述的步骤S001至步骤S004中,在获取到第一训练数据集和M个第一候选分类器模型之后,可以首先基于M个第一候选分类器模型的性能参数从中选择出性能最优的K个第一分类器模型,然后基于第一训练数据集对K个第一分类模型进行训练,从而得到训练好的K个第一分类器模型,进而利用训练好的K个第一分类器模型对待分类小程序进行分类,以确定待分类小程序的分类信息。
在一些实施例中,如图4B所示,在步骤S003之后,还可以通过以下步骤得到训练好的第二分类器模型:
步骤S005,利用该第一训练数据集和该K个第一分类器模型,构建第二训练数据集。
这里,该第二训练数据集中包括:K个第一分类器模型对训练小程序的预测信息和该训练小程序的标签信息,该预测信息至少包括训练小程序为无服务小程序的预测概率值。
步骤S006,获取预设的N个第二候选分类器模型,并基于该第二训练数据集确定该N个第二候选分类器模型对应的性能参数。
这里,N为大于1的整数。获取N个第二候选分类器模型在实现时,可以首先确定N个第二候选分类器模型的类型,然后通过网格搜索法搜索各第二候选分类器的最优超参数,从而获取到N个第二候选分类器模型。
步骤S006在实现时,可以是利用S折交叉验证方法(例如可以是十折交叉验证方法)确定该N个第二候选分类器模型在第二训练数据集下的性能参数。对于每个第二候选分类器模型,可以是确定一个或者多个性能参数。其中,性能参数包括但不限于准确率、精确度、召回率、F1-score、ROC、AUC。
需要说明的是,在步骤S006中确定的性能参数类型与步骤S002中确定的性能参数类型是相同的。例如,在步骤S002中确定的是第一分类器模型对应的精确度和召回率,那么在步骤S006中确定的也是第二分类器模型的精确度和召回率。
步骤S007,基于该N个第二候选分类器模型对应的性能参数和该K个第一分类器模型对应的性能参数,从该N个第二候选分类器模型中确定出第二分类器模型。
这里,步骤S007至少有以下两种实现方式:
第一种实现方式、将N个第二候选分类器模型的性能参数依次和K个第一分类器 模型的性能参数进行对比,一旦确定出一个第二候选分类器模型的性能参数均优于K个第一分类器模型,那么当这个第二候选分类器模型与K个第一分类器模型之间的性能差异值大于预设阈值时,就将该第二候选分类器模型确定为第二分类器模型。
第二种实现方式、将N个第二候选分类器模型中性能参数均优于K个第一分类器模型中,性能参数最优的第二候选分类器模型确定为第二分类器模型。
在一些实施例中,还可以是将N个第二候选分类器模型的性能参数依次和K个第一分类器模型的性能参数进行对比,一旦确定出一个第二候选分类器模型的性能参数均优于K个第一分类器模型,将该第二候选分类器模型确定为第二分类器模型。
步骤S008,利用第二训练数据集对该第二分类器模型进行训练,得到训练好的第二分类器模型。
这里,步骤S008在实现时,可以是将第二训练数据集中的K个第一分类器模型对训练小程序的预测信息输入至第二分类器模型中,得到第二分类器模型对训练小程序的训练预测值,然后根据训练预测值与各个训练小程序的标签信息确定训练预测值与实际标签信息的差值对第二分类器模型的参数进行调整,直至达到预设的训练完成条件,从而得到训练好的第二分类器模型。其中训练完成条件可以是达到预设的训练次数,或者可以是训练预测值与实际标签信息的差值小于预设阈值。
在一些实施例中,图4B所示的步骤S005可以通过以下步骤实现:
步骤S051,将该第一训练数据集划分为P个第一训练数据子集。
这里,P为大于1的整数。其中,P的取值是由步骤S002中采用的S折交叉验证方法决定的,其中P=S,也就是说,在步骤S002中采用的是十折交叉验证方法,那么在该步骤S051中将第一训练数据集划分为10个第一训练数据子集。
步骤S052,将第i个第一训练数据子集确定为第i测试数据集。
这里,i=1,2,…,P。
步骤S053,利用其他第一训练数据子集对该K个第一分类器模型进行训练,得到K个训练后的第一分类器模型。
这里,其他第一训练数据子集为除该第i个第一训练数据子集之外的(P-1)个第一训练数据子集。
步骤S054,利用该K个训练后的第一分类器模型对该第i测试数据集进行预测处理,得到该K个第一分类器模型对该第i测试数据集中训练小程序的预测信息。
这里,步骤S052至步骤S054循环执行P次,从而得到K个第一分类器模型对第1 至第P测试数据集中训练小程序的预测信息。
步骤S055,将该K个第一分类器模型对第1至第P测试数据集中训练小程序的预测信息和该训练小程序的标签信息确定为第二训练数据集。
这里,每个训练小程序对应有K个训练预测值,该K个训练预测值表示K个第一分类器模型对训练小程序为无服务小程序的预测概率值。
在一些实施例中,图4B所示的步骤S007可以有以下两种实现方式:
第一种实现方式可以通过以下步骤实现:
步骤S071A,依次将该N个第二候选分类器模型对应的性能参数和该K个第一分类器模型对应的性能参数进行比较。
步骤S072A,当确定第j个第二候选分类器模型对应的性能参数均优于该K个第一分类器模型对应的性能参数时,确定该第j个第二候选分类器模型与该K个第一分类器模型之间的性能差异值。
步骤S073A,当该性能差异值大于预设的差异阈值时,将该第j个第二候选分类器模型确定为第二分类器模型。
其中,j为1到N之间的整数。当性能差异值小于或者等于该差异阈值时,继续比对第j+1个第二候选分类器模型与K个第一分类器模型之间的性能参数。
在第一种实现方式中,可以不用比对完所有的N个第二分类器模型,即可确定出第二分类器模型,但是在第一种实现方式中不能保证确定出的第二分类器模型为N个第二候选分类器模型中性能最优的。
第二种实现方式可以通过以下步骤实现:
步骤S071B,基于该N个第二候选分类器模型对应的性能参数和该K个第一分类器模型对应的性能参数,确定出Q个第二目标分类器模型。
这里,第二目标分类器模型对应的性能参数均优于该K个第一分类器模型对应的性能参数。
步骤S072B,分别确定该Q个第二目标分类器模型与该K个第一分类器模型之间的Q个性能差异值。
步骤S073B,基于该Q个性能差异值,从该Q个第二目标分类器模型中确定出第二分类器模型。
这里,步骤S073B在实现时,可以是基于Q个性能差异值,从Q个第二目标分类器模型中确定出性能最优的作为第二分类器模型。
在第二种实现方式中,首先从N个第二候选分类器模型中,确定出所有的性能均优于K个第一分类器模型的Q个第二目标分类器模型,然后再从Q个第二目标分类器模型中确定出性能最优的一个作为第二分类器模型。第二种实现方式相对于第一种实现方式来说,计算量更大,但是能够确定出性能最优的第二分类器模型。在实际应用中,可以根据自身的实际需求确定采用第一种实现方式还是采用第二种实现方式。
在实际应用过程中,小程序开发者通过开发者终端提供的开发环境完成小程序开发后,在小程序上架前发送至服务器进行评审,服务器在接收到小程序文件后,对小程序进行评审。并在评审通过后运行该小程序并依次触发各个事件,获取小程序运行过程中的动态特征,进而根据动态特征对小程序进行分类,并存储小程序的分类信息,小程序的分类信息可以是无服务小程序或有服务小程序。当服务器接收到用户终端发送的小程序搜索请求后,基于搜索请求中携带的搜索关键字查询与搜索关键字匹配的小程序,并获取匹配出的各个小程序的分类信息,当匹配出的小程序的分类信息为无服务小程序时,可以将这些无服务小程序过滤掉,并向用户终端返回仅包括有服务小程序的搜索结果。在一些实施例中,服务器可以是在小程序评审通过后就发布到小程序商店,在发布之后再基于小程序动态特征对小程序进行分类,并存储小程序的分类信息。
本申请实施例再提供一种小程序分类方法,应用于上述应用场景,图5为本申请实施例提供的小程序分类方法的另一种实现流程示意图,以下结合图5对各个步骤进行说明。
步骤S501,开发者终端获取小程序开发者编辑的小程序代码。
在本申请实施例中,开发者终端提供有小程序开发环境或者小程序开发平台,小程序开发者可以通过小程序开发环境或小程序开发平台进行小程序代码编辑,从而实现小程序开发。
步骤S502,开发者终端基于接收到的上传操作,将开发完成的小程序代码发送至服务器。
这里,步骤S502在实现时,可以是小程序开发者在完成小程序代码开发后,封装成一个或多个能够在客户端的浏览器环境中运行的JavaScript文件,并上传到服务器。
步骤S503,服务器对该小程序进行评审,当评审通过时,将小程序上架。
这里,服务器对小程序进行评审在实现时可以是审核小程序内容是否符合规则,比如是否涉及谣言、欺诈、赌博等违规内容,还可以审核小程序代码是否存在缺陷(Bug)、 功能是否完整等等。在小程序评审通过后,将小程序发布到小程序商店或导航网站,也即用户可以搜索并使用该小程序。
步骤S504,服务器运行该小程序代码,获取该小程序在运行过程中的动态特征。
步骤S505,服务器将该动态特征分别输入至K个第一分类器模型中,对应得到K个初始预测值。
其中,K为大于1的整数。第一分类器模型对应其他实施例中的基分类器,初始预测值为该小程序为无服务小程序的概率值,为0到1之间的实数。
步骤S506,服务器将该K个初始预测值输入至第二分类器模型,以对该K个初始预测值进行集成处理,得到目标预测值。
这里,第二分类器模型对应其他实施例中的集成分类器,用于将K个初始预测值进行集成处理,从而得到最终的目标预测值。
步骤S507,服务器基于该目标预测值和预设的分类阈值,确定该小程序的分类信息。
这里,步骤S507在实现时,可以通过判断目标预测值和该分类阈值的大小关系确定待分类小程序的分类信息,其中,当该目标预测值大于该分类阈值时,确定该待分类小程序的分类信息为第一类型小程序;当该目标预测值小于或者等于该分类阈值时,确定该待分类小程序的分类信息为第二类型小程序,第一类型小程序为无服务小程序,第二类型小程序为有服务小程序。
步骤S508,服务器存储该小程序的分类信息。
这里,可以是存储该小程序的标识和分类信息的对应关系,还可以是将该分类信息作为小程序的一个属性信息进行存储。
在一些实施例中,上述步骤S504至步骤S508可以是在小程序评审通过后,发布小程序前执行的,也即在发布小程序前确定出小程序的分类信息,并将小程序的分类信息和小程序文件一并或先后发布到小程序商店。
步骤S509,用户终端响应于小程序搜索操作,获取搜索关键字。
步骤S510,用户终端向服务器发送搜索请求。
这里,该搜索请求中携带有搜索关键字。
步骤S511,服务器基于搜索请求中的搜索关键字进行搜索,得到第一搜索结果。
这里,服务器基于搜索关键字,从自身存储的小程序标识中确定与搜索关键字匹配的小程序标识,并将该小程序标识确定为第一搜索结果。
步骤S512,服务器获取第一搜索结果中各个小程序标识对应的分类信息。
步骤S513,服务器根据各个小程序标识对应的分类信息,从第一搜索结果中删除分类信息为无服务小程序的小程序标识,得到第二搜索结果。
步骤S514,服务器将第二搜索结果返回给用户终端。
在本申请实施例提供的小程序分类方法中,小程序开发者在完成小程序代码开发后,将小程序代码发送至服务器,由服务器进行评审,并在评审通过后将小程序上架,另外服务器还会运行小程序,并依次触发各个预设事件从而获取到小程序运行过程中的动态特征,进而根据小程序的动态特征和训练好的第一分类模型和第二分类模型确定小程序的分类信息,如此,在用户终端进行小程序搜索,向服务器发送搜索请求时,服务器在确定出与搜索关键字匹配的第一小程序后,再获取各个第一小程序的分类信息,以根据分类信息删除掉无服务小程序,并将删除掉无服务小程序的搜索结果返回给用户终端,从而保证用户最终得到的搜索结果都是能够提供服务的小程序。
在一些实施例中,在对小程序进行内容画像分析的过程中,需要对小程序进行分类,以确定小程序是无服务小程序还是有服务小程序。其中“无服务”指提供的服务是展示一些基本信息,如企业介绍或简历展示等,而无其他实际服务。
本申请实施例提供的小程序分类方法,包括以下步骤:
步骤S601,获取小程序原始动态特征。
小程序动态特征是相对于静态特征而言的。小程序的静态特征是指从小程序静态代码里所能挖掘得到的特征,比如控件数、文档对象模型(DOM,Document Object Model)、自定义组件等,然而写在静态代码里的元素并非一定会呈现或被调用,也因此小程序实际呈现给用户的页面和其静态代码并不能相互对应。而小程序的动态特征是指将小程序代码经渲染之后动态运行起来时所能得到的特征,比如有哪些可触发的事件、触发事件时调用的JS API、事件所绑定的控件等,这实际上就是模拟用户与页面的交互。因此,动态特征比起静态特征更能反应真实的用户交互体验。
除了从动态代码直接获取到的动态特征之外,无服务还应注意到事件触发后用户所能看到的页面上是否有改变,所以还需要对事件触发前后的页面进行截图,后续用于比对。
在本申请实施例中,在获取小程序原始动态特征时,对小程序代码进行渲染并动态运行,依次触发各事件,记录触发各事件时所调用的JS API,统计可触发事件所绑定的控件个数,对触发事件前后的小程序页面分别进行截图并进行base64编码,将以上 所有特征合并,并存为json文件,将该json文件保存到数据库,例如可以是mysql数据库。
步骤S602,提取并构造有效的动态特征。
这里,提取的动态特征可以包括统计特征、API特征和图片特征,其中,统计特征又可以包括可触发事件个数、API个数、可交互控件个数;API特征可以包括:各个API被调用的总次数,例如可以包括获取系统信息API被调用的总次数、扫描二维码API被调用的总次数、显示消息提示框API被调用的总次数等等;图片特征可以包括未触发事件时的页面截图和触发全部事件后的页面截图。
步骤S603,基于小程序动态特征对有无服务进行分类。
这里,步骤S603在实现时,可以是将获取到的小程序动态特征输入到训练好的多个基分类模型,得到各个基分类模型对该小程序的各个预测值,其中,该预测值为该小程序是无服务小程序的概率值;然后再将各个预测值输入到训练好的集成分类器模型,以通过该集成分类器模型对各个预测值进行集成,得到最终的预测值,该最终的预测值同样为该小程序是无服务小程序的概率值。在得到最终的预测值后,将最终的预测值与预设的分类阈值进行比较,得到该小程序的分类信息,例如可以是若最终的预测值大于该分类阈值则判决为无服务小程序,若最终的预测值小于该分类阈值则判决为不是无服务小程序。
以下对本申请实施例中提取的动态特征进行说明,表1为本申请实施例从json文件中提取的小程序动态特征:
表1、本申请实施例从json文件中提取的特征
Figure PCTCN2021096021-appb-000001
Figure PCTCN2021096021-appb-000002
在上述表1中示出了需要从json文件中提取的动态特征的特征类别、特征名以及变量名。
在本申请实施例中,基于图片特征,另外构造表2所示的以下特征:
表2、本申请实施例基于图片特征所构造的动态特征
Figure PCTCN2021096021-appb-000003
在本申请实施例中,步骤S602在实现时,保留表1中的统计特征和JS API特征以及表2中的Hash_diff,去掉表1中的Pic_0和Pic_1以及表2中的Hash_0和Hash_1,得到22维动态特征,该22维动态特征在步骤S603中作为输入信息,输入至分类器模型中,以确定小程序的分类信息。
在本申请实施例中,在步骤S603之前,需要首先通过以下步骤得到训练好的基分类器模型和集成分类器模型:
步骤S701,收集具有标签信息的数据,构造训练集(X,y)。
这里,标签信息用于表征小程序是无服务小程序还是有服务小程序,该标签信息可以是0或者1,当标签信息为1时表示该小程序为无服务小程序,当该标签信息为0时表示该小程序为有服务小程序。
具有标签信息的数据可以是小程序代码,步骤S701在实现时,在收集到具有标签信息的小程序代码后,运行并渲染该小程序代码,并获取该小程序的动态特征,进而根据小程序的标签信息和动态特征构建训练集(X,y)。其中,X∈R n×22、y∈{0,1} n×1,n表示有n个样本,22为步骤S602中提取到的22维特征。若y i为y的第i个元素,y i=1代表第i个为正样本,也即是无服务小程序;y i=0代表第i个为负样本,也即不是无服务小程序。
步骤S702,构建m个分类器。
这里,该m个分类器包括但不限于逻辑回归、支持向量机、决策树、朴素贝叶斯、K近邻、baggingK近邻、bagging决策树、随机森林、adaboost、梯度提升决策树。在本申请实施例中,可以采用网格搜索法搜索各分类器的最优超参数,并在最优超参数下用十折交叉验证评估各分类器在训练集(X,y)下的性能,性能指标包括但不限于准确率、精确度、召回率、F1-score、ROC AUC。
步骤S703,从m个分类器中挑选最关注的指标下性能最好的k个分类器作为基分类器,并新增一个分类器对基分类器进行堆叠(stacking)集成。
这里,如果最关注的指标为召回率,那么从m个分类中挑选出召回率最高的k个分类器作为基分类器,并且新增一个分类器作为集成分类器,以对基分类器进行stacking集成。其中,新增的分类器的类型包括但不限于是逻辑回归、支持向量机、决策树、朴素贝叶斯、K近邻、baggingK近邻、bagging决策树、随机森林、adaboost、梯度提升决策树。
在一些实施例中,步骤S703可以通过以下步骤实现:
步骤S7031,对训练集(X,y)分成10份,在每个基分类器下,每次取其中9份作为训练集对基分类器进行训练,剩下的1份作为测试集输入到基分类器中进行预测处理,得到无服务的概率值,重复10次。
由此,原始的训练集(X,y)经由k个分类器转换成(X 1,y),其中X 1∈R n×k
步骤S7032,新增一个分类器,对基分类器的预测结果进行集成。
这里,步骤S7032在实现时可以预先设定好候选新增分类器的类型,例如,候选新增分类器的类型可以有三个,分别为逻辑回归、支持向量机、决策树,然后再采用网格搜索法搜索各候选新增分类器的最优超参数,并在最优超参数下用十折交叉验证评估各候选新增分类器在训练集(X 1,y)下的性能。如果某一候选新增分类器的性能比其中至少一个基分类器的性能差,那么换一个另一个候选新增分类器重新尝试;如果某一候选新增分类器的性能比所有基分类器的性能都好,那么可以将该候选新增分类器确定为集成分类器,并采用(X,y)对各基分类器进行训练,采用(X 1,y)对集成分类器进行训练,并序列化保存训练好的各基分类器和集成分类器。
在一些实施例中,还可以是在确定出各个候选新增分类器的性能参数后,将各个候选分类器的性能参数与各个基分类器的性能参数进行对比,从各个候选新增分类器中确定出最优于各个基分类器的候选新增分类器作为集成分类器。
采用单特征加规则的方案,无服务小程序识别的精确度为95%、召回率为28.1%,虽然精确度很高,但会存在很大比例的漏过,不适用于低质过滤场景。采用本申请实施例提供的分类方法,无服务小程序识别的精确度为77%、召回率为84%,用18%的精确度换回56%的召回率,精确度和召回率均较高,且可以通过选择判决阈值平衡精确度和召回率,可应用于搜索低质过滤等场景。
在本申请实施例提供的小程序分类方法中,获取小程序在运行过程中的动态特征,从而基于小程序动态特征对有无服务进行分类,如此能够避免由于小程序静态代码和线上页面展示的差异以及单特征按规则划分的局限性而导致的误判和漏过,从而提高整体分类性能。
下面继续说明本申请实施例提供的小程序分类装置344实施为软件模块的示例性结构,在一些实施例中,如图2所示,存储在存储器340的小程序分类装置344中的软件模块可以是服务器300中的小程序分类装置,包括:
第一获取模块3441,配置为获取待分类小程序的小程序代码;
运行模块3442,配置为运行该小程序代码,获取该待分类小程序在运行过程中的动态特征;
第一确定模块3443,配置为将该动态特征输入训练好的分类器模型,得到该待分 类小程序的分类信息;
存储模块3444,配置为存储该待分类小程序的分类信息。
在一些实施例中,该运行模块3442,还配置为:
运行该小程序代码,获取第一小程序界面图像;
依次触发各个预设事件,获取触发成功的目标事件,并获取触发该目标事件时调用的应用程序接口信息和该目标事件对应的控件信息;
触发完该各个预设事件后,获取第二小程序界面图像;
基于目标事件的数量、该应用程序接口信息、控件信息、第一小程序界面图像和第二小程序界面图像确定该待分类小程序在运行过程中的动态特征。
在一些实施例中,该运行模块3442,还配置为:
基于该应用程序接口信息确定各个应用程序接口的调用总次数;
基于该控件信息确定可交互控件个数;
确定该第一小程序界面图像和第二小程序界面图像之间的图像差异信息;
基于该目标事件的数量、各个应用程序接口的调用总次数、可交互控件个数、图像差异信息确定该动态特征。
在一些实施例中,该训练好的分类器模型至少包括训练好的K个第一分类器模型,对应地,该第一确定模块3443,还配置为:
将该动态特征分别输入至K个第一分类器模型中,对应得到K个初始预测值,其中K为大于1的整数;
基于该K个初始预测值确定目标预测值;
基于该目标预测值和预设的分类阈值,确定该待分类小程序的分类信息。
在一些实施例中,该训练好的分类器模型还包括训练好的第二分类器模型,对应地,该第一确定模块3443,还配置为:
将该K个初始预测值输入至第二分类器模型,以对该K个初始预测值进行集成处理,得到目标预测值。
在一些实施例中,该第一确定模块3443,还配置为:
当该目标预测值大于该分类阈值时,确定该待分类小程序的分类信息为第一类型小程序;
当该目标预测值小于或者等于该分类阈值时,确定该待分类小程序的分类信息为第二类型小程序。
在一些实施例中,该装置还包括:
第二获取模块,配置为获取第一训练数据集和预设的M个第一候选分类器模型,其中,该第一训练数据集中包括训练小程序的动态特征和该训练小程序的标签信息,M为大于2的整数;
第二确定模块,配置为基于该第一训练数据集确定该M个第一候选分类器模型对应的性能参数;
第三确定模块,配置为基于该M个第一候选分类器模型对应的性能参数,确定出K个第一分类器模型;
第一训练模块,配置为利用该第一训练数据集对该K个第一分类器模型进行训练,得到K个训练好的第一分类器模型。
在一些实施例中,该装置还包括:
数据构建模块,配置为利用该第一训练数据集和该K个第一分类器模型,构建第二训练数据集,其中,该第二训练数据集中包括:该K个第一分类器模型对训练小程序的预测信息和该训练小程序的标签信息;
第三获取模块,配置为获取预设的N个第二候选分类器模型,并基于该第二训练数据集确定该N个第二候选分类器模型对应的性能参数,其中,N为大于1的整数;
第四确定模块,配置为基于该N个第二候选分类器模型对应的性能参数和该K个第一分类器模型对应的性能参数,从该N个第二候选分类器模型中确定出第二分类器模型;
第二训练模块,配置为利用第二训练数据集对该第二分类器模型进行训练,得到训练好的第二分类器模型。
在一些实施例中,该数据构建模块,还配置为:
将该第一训练数据集划分为P个第一训练数据子集,其中P为大于1的整数;
将第i个第一训练数据子集确定为第i测试数据集;i=1,2,…,P;
利用其他第一训练数据子集对该K个第一分类器模型进行训练,得到K个训练后的第一分类器模型;其中,该其他第一训练数据子集为除该第i个第一训练数据子集之外的P-1个第一训练数据子集;
利用该K个训练后的第一分类器模型对该第i测试数据集进行预测处理,得到该K个第一分类器模型对该第i测试数据集中训练小程序的预测信息;
将该K个第一分类器模型对第1至第P测试数据集中训练小程序的预测信息和该 训练小程序的标签信息确定为第二训练数据集。
在一些实施例中,该第四确定模块,还配置为:
依次将该N个第二候选分类器模型对应的性能参数和该K个第一分类器模型对应的性能参数进行比较;
当确定第j个第二候选分类器模型对应的性能参数均优于该K个第一分类器模型对应的性能参数时,确定该第j个第二候选分类器模型与该K个第一分类器模型之间的性能差异值;
当该性能差异值大于预设的差异阈值时,将该第j个第二候选分类器模型确定为第二分类器模型,其中,j为1到N之间的整数。
在一些实施例中,该第四确定模块,还配置为:
基于该N个第二候选分类器模型对应的性能参数和该K个第一分类器模型对应的性能参数,确定出Q个第二目标分类器模型,该第二目标分类器模型对应的性能参数均优于该K个第一分类器模型对应的性能参数;
分别确定该Q个第二目标分类器模型与该K个第一分类器模型之间的Q个性能差异值;
基于该Q个性能差异值,从该Q个第二目标分类器模型中确定出第二分类器模型。
需要说明的是,本申请实施例装置的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果,因此不做赘述。对于本装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
本申请实施例提供一种存储有可执行指令的存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法,例如,如图3、图4A、图4B和图5示出的方法。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的小程序分类方法。
在一些实施例中,存储介质可以是计算机可读存储介质,例如,铁电存储器(FRAM,Ferromagnetic Random Access Memory)、只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read Only Memory)、带电可擦可编程 只读存储器(EEPROM,Electrically Erasable Programmable Read Only Memory)、闪存、磁表面存储器、光盘、或光盘只读存储器(CD-ROM,Compact Disk-Read Only Memory)等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (14)

  1. 一种小程序分类方法,所述方法应用于小程序分类设备,包括:
    获取待分类小程序的小程序代码;
    运行所述小程序代码,获取所述待分类小程序在运行过程中的动态特征;
    将所述动态特征输入训练好的分类器模型,得到所述待分类小程序的分类信息;
    存储所述待分类小程序的分类信息。
  2. 根据权利要求1中所述的方法,其中,所述运行所述小程序代码,获取所述待分类小程序在运行过程中的动态特征,包括:
    运行所述小程序代码,获取第一小程序界面图像;
    依次触发各个预设事件,获取触发成功的目标事件,并获取触发所述目标事件时调用的应用程序接口信息和所述目标事件对应的控件信息;
    触发完所述各个预设事件后,获取第二小程序界面图像;
    基于目标事件的数量、所述应用程序接口信息、所述控件信息、所述第一小程序界面图像和所述第二小程序界面图像确定所述待分类小程序在运行过程中的动态特征。
  3. 根据权利要求2中所述的方法,其中,所述基于目标事件的数量、所述应用程序接口信息、所述控件信息、所述第一小程序界面图像和所述第二小程序界面图像确定所述待分类小程序在运行过程中的动态特征,包括:
    基于所述应用程序接口信息确定各个应用程序接口的调用总次数;
    基于所述控件信息确定可交互控件个数;
    确定所述第一小程序界面图像和所述第二小程序界面图像之间的图像差异信息;
    基于所述目标事件的数量、所述各个应用程序接口的调用总次数、所述可交互控件个数、所述图像差异信息确定所述动态特征。
  4. 根据权利要求1中所述的方法,其中,所述训练好的分类器模型至少包括训练好的K个第一分类器模型,对应地,所述将所述动态特征输入训练好的分类器模型,得到所述待分类小程序的分类信息,包括:
    将所述动态特征分别输入至K个第一分类器模型中,对应得到K个初始预测值,其中K为正整数;
    基于所述K个初始预测值确定目标预测值;
    基于所述目标预测值和预设的分类阈值,确定所述待分类小程序的分类信息。
  5. 根据权利要求4中所述的方法,其中,当K为大于1的整数时,所述训练好的分类器模型还包括训练好的第二分类器模型,对应地,所述基于所述K个初始预测值确定目标预测值,包括:
    将所述K个初始预测值输入至第二分类器模型,对所述K个初始预测值进行集成处理,得到目标预测值。
  6. 根据权利要求4中所述的方法,其中,所述基于所述目标预测值和预设的分类阈值,确定所述待分类小程序的分类信息,包括:
    当所述目标预测值大于所述分类阈值时,确定所述待分类小程序的分类信息为第一类型小程序;
    当所述目标预测值小于或者等于所述分类阈值时,确定所述待分类小程序的分类信息为第二类型小程序。
  7. 根据权利要求4中所述的方法,其中,所述方法还包括:
    获取第一训练数据集和预设的M个第一候选分类器模型,其中,所述第一训练数据集中包括训练小程序的动态特征和所述训练小程序的标签信息,M为大于1的整数;
    基于所述第一训练数据集确定所述M个第一候选分类器模型对应的性能参数;
    基于所述M个第一候选分类器模型对应的性能参数,确定K个第一分类器模型;
    利用所述第一训练数据集对所述K个第一分类器模型进行训练,得到K个训练好的第一分类器模型。
  8. 根据权利要求7中所述的方法,其中,所述方法还包括:
    利用所述第一训练数据集和所述K个第一分类器模型,构建第二训练数据集,其中,所述第二训练数据集包括:所述K个第一分类器模型对训练小程序的预测信息和所述训练小程序的标签信息;
    获取预设的N个第二候选分类器模型,并基于所述第二训练数据集确定所述N个第二候选分类器模型对应的性能参数,其中,N为大于1的整数;
    基于所述N个第二候选分类器模型对应的性能参数和所述K个第一分类器模型对应的性能参数,从所述N个第二候选分类器模型中确定出第二分类器模型;
    利用第二训练数据集对所述第二分类器模型进行训练,得到训练好的第二分类器模型。
  9. 根据权利要求8中所述的方法,其中,所述利用所述第一训练数据集和所述K 个第一分类器模型,构建第二训练数据集,包括:
    将所述第一训练数据集划分为P个第一训练数据子集,其中P为大于1的整数;
    将第i个第一训练数据子集确定为第i测试数据集;i=1,2,...,P;
    利用其他第一训练数据子集对所述K个第一分类器模型进行训练,得到K个训练后的第一分类器模型;其中,所述其他第一训练数据子集为除所述第i个第一训练数据子集之外的P-1个第一训练数据子集;
    利用所述K个训练后的第一分类器模型对所述第i测试数据集进行预测处理,得到所述K个第一分类器模型对所述第i测试数据集中训练小程序的预测信息;
    将所述K个第一分类器模型对第1至第P测试数据集中训练小程序的预测信息和所述训练小程序的标签信息确定为第二训练数据集。
  10. 根据权利要求8中所述的方法,其中,所述基于所述N个第二候选分类器模型对应的性能参数和所述K个第一分类器模型对应的性能参数,从所述N个第二候选分类器模型中确定出第二分类器模型,包括:
    依次将所述N个第二候选分类器模型对应的性能参数和所述K个第一分类器模型对应的性能参数进行比较;
    当确定第j个第二候选分类器模型对应的性能参数均优于所述K个第一分类器模型对应的性能参数时,确定所述第j个第二候选分类器模型与所述K个第一分类器模型之间的性能差异值;
    当所述性能差异值大于预设的差异阈值时,将所述第j个第二候选分类器模型确定为第二分类器模型,其中,j=1,2,...,N。
  11. 根据权利要求8中所述的方法,其中,所述基于所述N个第二候选分类器模型对应的性能参数和所述K个第一分类器模型对应的性能参数,从所述N个第二候选分类器模型中确定出第二分类器模型,包括:
    基于所述N个第二候选分类器模型对应的性能参数和所述K个第一分类器模型对应的性能参数,确定Q个第二目标分类器模型,所述第二目标分类器模型对应的性能参数均优于所述K个第一分类器模型对应的性能参数,Q为小于或等于N的整数;
    分别确定所述Q个第二目标分类器模型与所述K个第一分类器模型之间的Q个性能差异值;
    基于所述Q个性能差异值,从所述Q个第二目标分类器模型中确定出第二分类器模型。
  12. 一种小程序分类装置,包括:
    第一获取模块,配置为获取待分类小程序的小程序代码;
    运行模块,配置为运行所述小程序代码,获取所述待分类小程序在运行过程中的动态特征;
    第一确定模块,配置为将所述动态特征输入训练好的分类器模型,得到所述待分类小程序的分类信息;
    存储模块,配置为存储所述待分类小程序的分类信息。
  13. 一种小程序分类设备,包括:
    存储器,配置为存储可执行指令;处理器,配置为执行所述存储器中存储的可执行指令时,实现权利要求1至11任一项所述的小程序分类方法。
  14. 一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现权利要求1至11任一项所述的小程序分类方法。
PCT/CN2021/096021 2020-06-23 2021-05-26 小程序分类方法、装置、设备及计算机可读存储介质 WO2021258968A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/732,382 US20220253307A1 (en) 2020-06-23 2022-04-28 Miniprogram classification method, apparatus, and device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010583738.0 2020-06-23
CN202010583738.0A CN113837210A (zh) 2020-06-23 2020-06-23 小程序分类方法、装置、设备及计算机可读存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/732,382 Continuation US20220253307A1 (en) 2020-06-23 2022-04-28 Miniprogram classification method, apparatus, and device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2021258968A1 true WO2021258968A1 (zh) 2021-12-30

Family

ID=78964242

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096021 WO2021258968A1 (zh) 2020-06-23 2021-05-26 小程序分类方法、装置、设备及计算机可读存储介质

Country Status (3)

Country Link
US (1) US20220253307A1 (zh)
CN (1) CN113837210A (zh)
WO (1) WO2021258968A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818987A (zh) * 2022-06-20 2022-07-29 中山大学深圳研究院 一种科技服务数据的处理方法、装置以及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114595391A (zh) * 2022-03-17 2022-06-07 北京百度网讯科技有限公司 基于信息搜索的数据处理方法、装置和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778266A (zh) * 2016-11-24 2017-05-31 天津大学 一种基于机器学习的安卓恶意软件动态检测方法
CN107729927A (zh) * 2017-09-30 2018-02-23 南京理工大学 一种基于lstm神经网络的手机应用分类方法
US20190222499A1 (en) * 2016-09-22 2019-07-18 Huawei Technologies Co., Ltd. Network Data Flow Classification Method and System
CN111222137A (zh) * 2018-11-26 2020-06-02 华为技术有限公司 一种程序分类模型训练方法、程序分类方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190222499A1 (en) * 2016-09-22 2019-07-18 Huawei Technologies Co., Ltd. Network Data Flow Classification Method and System
CN106778266A (zh) * 2016-11-24 2017-05-31 天津大学 一种基于机器学习的安卓恶意软件动态检测方法
CN107729927A (zh) * 2017-09-30 2018-02-23 南京理工大学 一种基于lstm神经网络的手机应用分类方法
CN111222137A (zh) * 2018-11-26 2020-06-02 华为技术有限公司 一种程序分类模型训练方法、程序分类方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818987A (zh) * 2022-06-20 2022-07-29 中山大学深圳研究院 一种科技服务数据的处理方法、装置以及系统
CN114818987B (zh) * 2022-06-20 2022-11-08 中山大学深圳研究院 一种科技服务数据的处理方法、装置以及系统

Also Published As

Publication number Publication date
CN113837210A (zh) 2021-12-24
US20220253307A1 (en) 2022-08-11

Similar Documents

Publication Publication Date Title
US10958748B2 (en) Resource push method and apparatus
CN109791642B (zh) 工作流的自动生成
CN112632385A (zh) 课程推荐方法、装置、计算机设备及介质
US20220215296A1 (en) Feature effectiveness assessment method and apparatus, electronic device, and storage medium
US11416754B1 (en) Automated cloud data and technology solution delivery using machine learning and artificial intelligence modeling
WO2021258968A1 (zh) 小程序分类方法、装置、设备及计算机可读存储介质
CN112328909B (zh) 信息推荐方法、装置、计算机设备及介质
US20180357078A1 (en) Device with extensibility
US20210342743A1 (en) Model aggregation using model encapsulation of user-directed iterative machine learning
US20200334697A1 (en) Generating survey responses from unsolicited messages
CN112085087B (zh) 业务规则生成的方法、装置、计算机设备及存储介质
CN112256537B (zh) 模型运行状态的展示方法、装置、计算机设备和存储介质
CN114840869A (zh) 基于敏感度识别模型的数据敏感度识别方法及装置
CN113569115A (zh) 数据分类方法、装置、设备及计算机可读存储介质
WO2023050143A1 (zh) 一种推荐模型训练方法及装置
CN110442803A (zh) 由计算设备执行的数据处理方法、装置、介质和计算设备
US20220083907A1 (en) Data generation and annotation for machine learning
US11567851B2 (en) Mathematical models of graphical user interfaces
CN113919361A (zh) 一种文本分类方法和装置
US11501071B2 (en) Word and image relationships in combined vector space
CN117522538A (zh) 招投标信息处理方法、装置、计算机设备及存储介质
CN114330353B (zh) 虚拟场景的实体识别方法、装置、设备、介质及程序产品
US20230140828A1 (en) Machine Learning Methods And Systems For Cataloging And Making Recommendations Based On Domain-Specific Knowledge
US20220309335A1 (en) Automated generation and integration of an optimized regular expression
Karthikeyan et al. Mobile Artificial Intelligence Projects: Develop seven projects on your smartphone using artificial intelligence and deep learning techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21828189

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17.05.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21828189

Country of ref document: EP

Kind code of ref document: A1