CN108334895A - Sorting technique, device, storage medium and the electronic device of target data - Google Patents

Sorting technique, device, storage medium and the electronic device of target data Download PDF

Info

Publication number
CN108334895A
CN108334895A CN201711499178.5A CN201711499178A CN108334895A CN 108334895 A CN108334895 A CN 108334895A CN 201711499178 A CN201711499178 A CN 201711499178A CN 108334895 A CN108334895 A CN 108334895A
Authority
CN
China
Prior art keywords
target
data
file
classification results
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711499178.5A
Other languages
Chinese (zh)
Other versions
CN108334895B (en
Inventor
王世伟
韩萌
龙锦就
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201711499178.5A priority Critical patent/CN108334895B/en
Publication of CN108334895A publication Critical patent/CN108334895A/en
Application granted granted Critical
Publication of CN108334895B publication Critical patent/CN108334895B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of sorting technique of target data, device, storage medium and electronic devices.Wherein, this method includes:Obtain model file, wherein model file is the file for storing object module, and object module is the model for executing classification task being trained using sample data;Multiple data characteristics, class condition and classification results with correspondence are extracted from model file, and multiple data characteristics, class condition and classification results with correspondence are converted to the multiple functions for meeting object format, generate target script file;Classified to target data based on target script file.Complexity higher technical problem when the present invention solves disaggregated model processing large scale data classification task in the related technology.

Description

Sorting technique, device, storage medium and the electronic device of target data
Technical field
The present invention relates to computer realms, are situated between in particular to a kind of sorting technique of target data, device, storage Matter and electronic device.
Background technology
It is typically to be trained to obtain to sample data using modeling tool in the existing mode classified to data Model is backuped on local computer generate local model file thereafter, utilizes the anticipation function pair in model file by model Non-classified data are classified.Although this mode solves the data classification problem of a part, but the mould due to training The running environment that the operation of type file is built dependent on model training, is applied to other equipment by the model file if necessary, It then needs to build complicated running environment again in other equipment, which results in trained disaggregated models to be only suitable for this Ground data are classified, and can not carry out large-scale classification task.
For above-mentioned problem, currently no effective solution has been proposed.
Invention content
An embodiment of the present invention provides a kind of sorting technique of target data, device, storage medium and electronic devices, so that It is few to solve complexity higher technical problem when disaggregated model processing large scale data classification task in the related technology.
One side according to the ... of the embodiment of the present invention provides a kind of sorting technique of target data, including:Obtain model File, wherein the model file is the file for storing object module, and the object module is to be carried out using sample data The model for executing classification task that training obtains;Multiple data with correspondence are extracted from the model file Feature, class condition and classification results, and by the multiple data characteristics, class condition and classification results with correspondence The multiple functions for meeting object format are converted to, target script file is generated;Based on the target script file to target data Classify.
Another aspect according to the ... of the embodiment of the present invention additionally provides a kind of sorter of target data, including:Obtain mould Block, for obtaining model file, wherein the model file is the file for storing object module, and the object module is The model for executing classification task being trained using sample data;Processing module is used for from the model file In extract multiple data characteristics, class condition and classification results with correspondence, and there is corresponding close by the multiple Data characteristics, class condition and the classification results of system are converted to the multiple functions for meeting object format, generate target script file; Sort module classifies to target data for being based on the target script file.
Another aspect according to the ... of the embodiment of the present invention, additionally provides a kind of storage medium, and the storage medium includes storage Program, wherein described program run when execute any of the above-described described in method.
Another aspect according to the ... of the embodiment of the present invention, additionally provides a kind of electronic device, including memory, processor and deposits The computer program that can be run on the memory and on the processor is stored up, the processor passes through the computer journey Sequence executes the method described in any of the above-described.
In embodiments of the present invention, model file is obtained, wherein model file is the file for storing object module, Object module is the model for executing classification task being trained using sample data;It is extracted from model file Multiple data characteristics, class condition and classification results with correspondence, and by multiple data characteristicses with correspondence, Class condition and classification results are converted to the multiple functions for meeting object format, generate target script file;Based on target script File classifies to target data.That is, by the data characteristics of multiple in model file with correspondence, classification Condition and classification results are extracted from model file, and the correspondence is converted into meet the function of object format, are obtained To target script file, so that the class condition and classification results in model file are recorded by the function of rule, It, only need to be by the target of generation when disaggregated model needs to be installed on multiple devices to carry out classification processing to large-scale data Script file is transferred to each equipment, and is classified to target data based on the target script file, avoids classification The complicated running environment of model is built again, to reduce complexity when disaggregated model handles large scale data classification task Degree, and then overcome the problems, such as that complexity is higher when disaggregated model processing large scale data classification task in the related technology.
Description of the drawings
Attached drawing described herein is used to provide further understanding of the present invention, and is constituted part of this application, this hair Bright illustrative embodiments and their description are not constituted improper limitations of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is a kind of application environment schematic diagram of the sorting technique of optional target data according to the ... of the embodiment of the present invention;
Fig. 2 is a kind of schematic diagram of the sorting technique of optional target data according to the ... of the embodiment of the present invention;
Fig. 3 is that target data is classified in the sorting technique according to a kind of target data of optional embodiment of the invention Schematic diagram;
Fig. 4 is the signal of model analyzing in the sorting technique according to a kind of target data of optional embodiment of the invention Figure;
Fig. 5 is a kind of schematic diagram of the sorter of optional target data according to the ... of the embodiment of the present invention;
Fig. 6 is a kind of application scenarios schematic diagram of the sorting technique of optional target data according to the ... of the embodiment of the present invention One;
Fig. 7 is a kind of application scenarios schematic diagram of the sorting technique of optional target data according to the ... of the embodiment of the present invention Two;
Fig. 8 is a kind of application scenarios schematic diagram of the sorting technique of optional target data according to the ... of the embodiment of the present invention Three;
Fig. 9 is a kind of application scenarios schematic diagram of the sorting technique of optional target data according to the ... of the embodiment of the present invention Four;
Figure 10 is a kind of application scenarios schematic diagram of the sorting technique of optional target data according to the ... of the embodiment of the present invention Five;And
Figure 11 is a kind of schematic diagram of optional electronic device according to the ... of the embodiment of the present invention.
Specific implementation mode
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The every other embodiment that member is obtained without making creative work should all belong to the model that the present invention protects It encloses.
It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be for distinguishing similar object, without being used to describe specific sequence or precedence.It should be appreciated that using in this way Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover It includes to be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment to cover non-exclusive Those of clearly list step or unit, but may include not listing clearly or for these processes, method, product Or the other steps or unit that equipment is intrinsic.
In embodiments of the present invention, a kind of embodiment of the sorting technique of above-mentioned target data is provided.It can as one kind The embodiment of choosing, the sorting technique of the target data can be, but not limited to be applied in application environment as shown in Figure 1, equipment 102 are connect by network 106 with equipment 104, and equipment 102 is used to obtain model file from equipment 104 by network 106, wherein Model file is the file for storing object module, and object module is to be used to execute using what sample data was trained The model of classification task;Multiple data characteristics, class condition and classification knots with correspondence are extracted from model file Fruit, and multiple data characteristics, class condition and classification results with correspondence are converted to and meet the multiple of object format Function generates target script file;Classified to target data based on target script file;Equipment 104, for storing target The corresponding model file of model.
In the present embodiment, equipment 102 is by multiple numbers with correspondence in the model file stored in equipment 104 It is extracted from model file according to feature, class condition and classification results, and the correspondence is converted into meet target lattice The function of formula obtains target script file, so that letter of the class condition and classification results in model file by rule Number is recorded, and when disaggregated model needs to be installed on multiple devices to carry out classification processing to large-scale data, only need to The target script file of generation is transferred to each equipment, and is classified i.e. to target data based on the target script file Can, building again for the complicated running environment of disaggregated model is avoided, to reduce disaggregated model processing large-scale data point Complexity when generic task, and then overcome complexity when disaggregated model handles large scale data classification task in the related technology higher The problem of.
Optionally, in the present embodiment, above equipment can include but is not limited at least one of:Mobile phone, tablet electricity The hardware device of brain, laptop, desktop PC, DTV and other progress district-shares.Above-mentioned network may include But it is not limited at least one of:Wide area network, Metropolitan Area Network (MAN), LAN.Above-mentioned only a kind of example, the present embodiment do not appoint this What is limited.
According to embodiments of the present invention, a kind of sorting technique of target data is provided, as shown in Fig. 2, this method includes:
S202 obtains model file, wherein model file is the file for storing object module, and object module is to make The model for executing classification task being trained with sample data;
S204 extracts multiple data characteristics, class condition and classification results with correspondence from model file, And multiple data characteristics, class condition and classification results with correspondence are converted to the multiple letters for meeting object format Number generates target script file;
S206 classifies to target data based on target script file.
Optionally, in the present embodiment, the sorting technique of above-mentioned target data can be, but not limited to be applied to execute data In the scene of classification task.Wherein, above equipment can be, but not limited to as the server of various types of softwares, for example, search Engine software, news ocr software, instant message applications, shopping platform software, Games Software etc..Specifically, can with but it is unlimited In applied in above-mentioned search engine software execute web page resources data classification task scene in, or can with but it is unlimited In the scene applied to the classification task for executing multimedia resource in above-mentioned news ocr software, to reduce at disaggregated model Manage complexity when large scale data classification task.Above-mentioned is only a kind of example, and any restriction is not done to this in the present embodiment.
Optionally, in the present embodiment, target data can be, but not limited to include multi-medium data (video data, audio Data), image data, text data, web data etc..
Optionally, in the present embodiment, in above-mentioned steps S204, can be, but not limited to by the way of model analyzing will All relationships between model parameter, primary condition and other input informations and simulated time and result in model file are equal It is indicated with formula, equation and inequality.
Optionally, in the present embodiment, model file can be, but not limited to be the corresponding model file of various disaggregated models. Such as:Model file can be, but not limited to include xgboost model files.Target script file can be, but not limited to be various machines The script file that device language is write, such as:Target script file can be, but not limited to include python script files.Target script File can with but be not limited to R language scripts file, matlab script files, C++ script files etc..
Optionally, in the present embodiment, in above-mentioned steps S206, can be based on target script file to target data into The processing of row distributed stream.Distributed stream processing is to handle up for a kind of distribution of stream data, height, low latency, have itself Fault-tolerant real-time calculation is a kind of technology continuing processing according to one group of processing rule.In the meter that more run parallel It calls the target script file on calculation machine respectively, the more computers run parallel while reading stream data and be loaded into each self-regulated Classification processing is carried out to data in target script file, to realize the parallel processing of extensive stream data.
As it can be seen that through the above steps, by the data characteristics of multiple in model file with correspondence, class condition and Classification results are extracted from model file, and the correspondence is converted into meet the function of object format, obtain target Script file works as classification so that the class condition and classification results in model file are recorded by the function of rule Model is needed when installation on multiple devices to large-scale data so that carry out classification processing, only need to be by the target script text of generation Part is transferred to each equipment, and is classified to target data based on the target script file, avoids disaggregated model Complicated running environment is built again, to reduce complexity when disaggregated model handles large scale data classification task, into And overcome the problems, such as that complexity is higher when disaggregated model processing large scale data classification task in the related technology.
As a kind of optional scheme, carrying out classification to target data based on target script file includes:
S1 extracts target data feature from target data;
S2, invocation target script file, and target data feature is inputted into target script file, obtain multiple target classifications As a result;
S3 obtains target operation result to multiple target classification result performance objective operations;
S4 determines the targets threshold range that target operation result is fallen into multiple threshold ranges;
The corresponding target category label of targets threshold range is determined as the label of target data, wherein multiple threshold values by S5 Range is corresponded with multiple class labels.
Optionally, in the present embodiment, the target data feature extracted from target data is with object module in training The data characteristics used is consistent.The feature of which type is used to be trained it in the training process of object module, The target data feature of which type is just extracted from target data so in the assorting process to target data.Such as:With For the classification of image data, the color in image data is extracted in the training process of object module, brightness, contrast, is satisfied Data characteristics with degree as image data, is trained object module using above-mentioned data characteristics to obtain object module, will The corresponding model file of object module is converted into target script file, classifies to target data using target script file During, the target data feature of color, brightness, contrast, saturation degree as target data is extracted from target data, it is defeated Enter into target script file, to obtain target classification result.
Optionally, in the present embodiment, special due to having recorded multiple data with correspondence in target script file Sign, class condition and classification results, are input to target script file by target data feature, can be obtained from multiple correspondences To the corresponding class condition of multiple target data features and classification results, to obtain the class condition of target data feature satisfaction Corresponding classification results obtain multiple target classification results.
Optionally, in the present embodiment, it for the target script file for using different scripts to obtain, can design Different principal function invocation target script files.Target data and the performance objective script file in principal function are read, to To multiple target classification results.
Optionally, in the present embodiment, multiple threshold ranges are pre-set, and determine multiple threshold ranges and multiple classifications The one-to-one relationship of label, to the targets threshold that the determining target operation result is fallen into after obtaining target operation result Range, and then targets threshold model is determined according to the one-to-one relationship between above-mentioned multiple threshold ranges and multiple class labels Corresponding target category label is enclosed, to the label of target data, realizes the classification to target data.
Target operation result packet is obtained to multiple target classification result performance objective operations as a kind of optional scheme It includes:
S1 carries out summation operation to multiple target classification results, obtains summed result;
Summed result is converted to destination probability value using sigmoid functions, and destination probability value is determined as target by S2 Operation result.
Optionally, in the present embodiment, it is obtained in corresponding multiple branches in target script file from target data feature After getting multiple target classification results, summation operation, and the summed result that summation operation is obtained are carried out to multiple classification results It is transformed into 0 to 1 section using sigmoid functions, obtains destination probability value, using this destination probability value as to number of targets According to the target operation result classified.
Optionally, in the present embodiment, it is above-mentioned multiple threshold ranges, each threshold range pair by 0 to 1 interval division A class label is answered, after obtaining above-mentioned target operation result, the corresponding targets threshold range of target operation result is obtained, by mesh The corresponding target category label of mark threshold range is determined as the label of target data.
In an optional embodiment, as shown in figure 3, classified to target data based on target script file Detailed process includes the following steps:
Step S302, loading rule function.The multiple functions for meeting object format stored in target script file are suitable Then rule function loads the rule function in target script file by the principal function invocation target script file of design.
Step S304, streaming read target data.When handling large-scale data, target data to be sorted can be with stream The form of formula data is read out.
Target data is resolved to set form by step S306.The mesh for meeting call format is extracted from target data Mark data characteristics.
Step S308, the distributed calling rule function of load parsing.The corresponding rule function of invocation target data characteristics, from And multiple target classification results are obtained according to the class condition in rule function.
Step S310 obtains each branch outcome.The corresponding each rule function of above-mentioned target data feature regards one as Branch obtains multiple target classification results as each branch outcome.
Step S312 sums to each branch outcome, and summed result is mapped as target by sigmoid functions Probability value.
Step S314, judges whether target data meets preset screening conditions, such as:Whether destination probability value is more than threshold Value.
Step S316 obtains the target data for meeting screening conditions.
As a kind of optional scheme, multiple data characteristicses with correspondence, classification are extracted from model file Condition and classification results, and multiple data characteristics, class condition and classification results with correspondence are converted to and meet mesh Multiple functions of style formula, generating target script file includes:
S1 searches classification results from model file;
S2, extracts the corresponding class condition of the classification results found from model file and class condition includes Data characteristics;
S3 establishes the correspondence between data characteristics, class condition and classification results, obtains the number with correspondence According to feature, class condition and classification results;
Data characteristics, class condition and classification results with correspondence are converted to the foot of target machine language by S4 This document obtains target script file.
In an optional embodiment, with object module be xgboost models, target machine language is python languages For speech.The resolving of above-mentioned model file is illustrated.As shown in figure 4, generating target script text in the following way Part.Each boost in file is respectively stored into list by reading model file first.It is successively read the list of each storage, Content therein is obtained one by one, judges whether to encounter " leaf ", if not encountering " leaf ", preserves the judgement item of branch The contents such as part, branch's serial number, if it is, by the Context resolution of above-mentioned preservation at the rule function of the branch, until the list In content obtaining it is complete.Judge whether list runs through, if do not run through, continues to read next list, repeat above Process.If run through, target script file is generated according to the rule function parsed.
It should be noted that for each method embodiment above-mentioned, for simple description, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the described action sequence because According to the present invention, certain steps can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical scheme of the present invention is substantially in other words to existing The part that technology contributes can be expressed in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, calculate Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.
According to embodiments of the present invention, a kind of target data for implementing the sorting technique of above-mentioned target data is additionally provided Sorter, as shown in figure 5, the device includes:
1) acquisition module 52, for obtaining model file, wherein model file is the file for storing object module, Object module is the model for executing classification task being trained using sample data;
2) processing module 54, for extracting multiple data characteristicses with correspondence, classification item from model file Part and classification results, and multiple data characteristics, class condition and classification results with correspondence are converted to and meet target Multiple functions of format generate target script file;
3) sort module 56 classify to target data for being based on target script file.
Optionally, in the present embodiment, the sorter of above-mentioned target data can be, but not limited to be applied to execute data In the scene of classification task.Wherein, above equipment can be, but not limited to as the server of various types of softwares, for example, search Engine software, news ocr software, instant message applications, shopping platform software, Games Software etc..Specifically, can with but it is unlimited In applied in above-mentioned search engine software execute web page resources data classification task scene in, or can with but it is unlimited In the scene applied to the classification task for executing multimedia resource in above-mentioned news ocr software, to reduce at disaggregated model Manage complexity when large scale data classification task.Above-mentioned is only a kind of example, and any restriction is not done to this in the present embodiment.
Optionally, in the present embodiment, target data can be, but not limited to include multi-medium data (video data, audio Data), image data, text data, web data etc..
Optionally, in the present embodiment, in above-mentioned steps S204, can be, but not limited to by the way of model analyzing will All relationships between model parameter, primary condition and other input informations and simulated time and result in model file are equal It is indicated with formula, equation and inequality.
Optionally, in the present embodiment, model file can be, but not limited to be the corresponding model file of various disaggregated models. Such as:Model file can be, but not limited to include xgboost model files.Target script file can be, but not limited to be various machines The script file that device language is write, such as:Target script file can be, but not limited to include python script files.Target script File can with but be not limited to R language scripts file, matlab script files, C++ script files etc..
Optionally, in the present embodiment, target script file can be based on and distributed stream processing is carried out to target data.Point Cloth stream process be for stream data it is a kind of it is distributed, high handle up, low latency, with itself fault-tolerant real-time calculating side Formula is a kind of technology continuing processing according to one group of processing rule.Being called respectively on the computer that more run parallel should Target script file, the more computers run parallel while reading stream data and is loaded into the target script file respectively called In classification processing is carried out to data, to realize the parallel processing of extensive stream data.
As it can be seen that by above-mentioned apparatus, by the data characteristics of multiple in model file with correspondence, class condition and Classification results are extracted from model file, and the correspondence is converted into meet the function of object format, obtain target Script file works as classification so that the class condition and classification results in model file are recorded by the function of rule Model is needed when installation on multiple devices to large-scale data so that carry out classification processing, only need to be by the target script text of generation Part is transferred to each equipment, and is classified to target data based on the target script file, avoids disaggregated model Complicated running environment is built again, to reduce complexity when disaggregated model handles large scale data classification task, into And overcome the problems, such as that complexity is higher when disaggregated model processing large scale data classification task in the related technology.
As a kind of optional scheme, sort module includes:
1) the first extraction unit, for extracting target data feature from target data;
2) processing unit is used for invocation target script file, and target data feature is inputted target script file, obtains Multiple target classification results;
3) execution unit, for multiple target classification result performance objective operations, obtaining target operation result;
4) the first determination unit, the targets threshold model fallen into multiple threshold ranges for determining target operation result It encloses;
5) the second determination unit, the mark for the corresponding target category label of targets threshold range to be determined as to target data Label, wherein multiple threshold ranges are corresponded with multiple class labels.
Optionally, in the present embodiment, the target data feature extracted from target data is with object module in training The data characteristics used is consistent.The feature of which type is used to be trained it in the training process of object module, The target data feature of which type is just extracted from target data so in the assorting process to target data.Such as:With For the classification of image data, the color in image data is extracted in the training process of object module, brightness, contrast, is satisfied Data characteristics with degree as image data, is trained object module using above-mentioned data characteristics to obtain object module, will The corresponding model file of object module is converted into target script file, classifies to target data using target script file During, the target data feature of color, brightness, contrast, saturation degree as target data is extracted from target data, it is defeated Enter into target script file, to obtain target classification result.
Optionally, in the present embodiment, special due to having recorded multiple data with correspondence in target script file Sign, class condition and classification results, are input to target script file by target data feature, can be obtained from multiple correspondences To the corresponding class condition of multiple target data features and classification results, to obtain the class condition of target data feature satisfaction Corresponding classification results obtain multiple target classification results.
Optionally, in the present embodiment, it for the target script file for using different scripts to obtain, can design Different principal function invocation target script files.Target data and the performance objective script file in principal function are read, to To multiple target classification results.
Optionally, in the present embodiment, multiple threshold ranges are pre-set, and determine multiple threshold ranges and multiple classifications The one-to-one relationship of label, to the targets threshold that the determining target operation result is fallen into after obtaining target operation result Range, and then targets threshold model is determined according to the one-to-one relationship between above-mentioned multiple threshold ranges and multiple class labels Corresponding target category label is enclosed, to the label of target data, realizes the classification to target data.
As a kind of optional scheme, execution unit includes:
1) subelement of summing obtains summed result for carrying out summation operation to multiple target classification results;
2) conversion subunit, for summed result to be converted to destination probability value using sigmoid functions, and target is general Rate value is determined as target operation result.
Optionally, in the present embodiment, it is obtained in corresponding multiple branches in target script file from target data feature After getting multiple target classification results, summation operation, and the summed result that summation operation is obtained are carried out to multiple classification results It is transformed into 0 to 1 section using sigmoid functions, obtains destination probability value, using this destination probability value as to number of targets According to the target operation result classified.
Optionally, in the present embodiment, it is above-mentioned multiple threshold ranges, each threshold range pair by 0 to 1 interval division A class label is answered, after obtaining above-mentioned target operation result, the corresponding targets threshold range of target operation result is obtained, by mesh The corresponding target category label of mark threshold range is determined as the label of target data.
In an optional embodiment, as shown in figure 3, classified to target data based on target script file Detailed process includes the following steps:
Step S302, loading rule function.The multiple functions for meeting object format stored in target script file are suitable Then rule function loads the rule function in target script file by the principal function invocation target script file of design.
Step S304, streaming read target data.When handling large-scale data, target data to be sorted can be with stream The form of formula data is read out.
Target data is resolved to set form by step S306.The mesh for meeting call format is extracted from target data Mark data characteristics.
Step S308, the distributed calling rule function of load parsing.The corresponding rule function of invocation target data characteristics, from And multiple target classification results are obtained according to the class condition in rule function.
Step S310 obtains each branch outcome.The corresponding each rule function of above-mentioned target data feature regards one as Branch obtains multiple target classification results as each branch outcome.
Step S312 sums to each branch outcome, and summed result is mapped as target by sigmoid functions Probability value.
Step S314, judges whether target data meets preset screening conditions, such as:Whether destination probability value is more than threshold Value.
Step S316 obtains the target data for meeting screening conditions.
As a kind of optional scheme, processing module includes:
1) searching unit, for searching classification results from model file;
2) the second extraction unit, for extracted from model file the corresponding class condition of classification results found and The data characteristics that class condition includes;
3) unit is established, for establishing the correspondence between data characteristics, class condition and classification results, is had Data characteristics, class condition and the classification results of correspondence;
4) converting unit, for the data characteristics with correspondence, class condition and classification results to be converted to target The script file of machine language obtains target script file.
In an optional embodiment, with object module be xgboost models, target machine language is python languages For speech.The resolving of above-mentioned model file is illustrated.As shown in figure 4, generating target script text in the following way Part.Each boost in file is respectively stored into list by reading model file first.It is successively read the list of each storage, Content therein is obtained one by one, judges whether to encounter " leaf ", if not encountering " leaf ", preserves the judgement item of branch The contents such as part, branch's serial number, if it is, by the Context resolution of above-mentioned preservation at the rule function of the branch, until the list In content obtaining it is complete.Judge whether list runs through, if do not run through, continues to read next list, repeat above Process.If run through, target script file is generated according to the rule function parsed.
The application environment of the embodiment of the present invention can be, but not limited to reference to the application environment in above-described embodiment, the present embodiment In this is repeated no more.An embodiment of the present invention provides the optional tools of one kind of the sorting technique for implementing above-mentioned target data Body application example.
As a kind of optional embodiment, the update method of above-mentioned configuration object can be, but not limited to be applied to such as Fig. 6 institutes In the scene that the configuration object to client software shown is updated.It is provided in the present embodiment a kind of based on python scripts The method of document analysis xgboost model files.The model file of xgboost is parsed by this method a kind of general can flow The python script files of processing, this method include mainly two process flows:
1) model analyzing:The xgboost model files that the backup (dump) that training is completed is arrived are turned by python language Turn to the general rule function based on general python language.
2) model calls:Above-mentioned rule function is called in principal function, is obtained each target data and is passed through rule function The target classification that each branch obtains using sigmoid functions as a result, obtain the destination probability value of sample, so after summation operation The targets threshold range fallen into afterwards according to the destination probability value obtains the classification of target data.
In an optional embodiment, as shown in fig. 7, for the department pattern text in the xgboost models after training Part searches classification results from model file, that is, carries the sentence of " leaf ", the classification knot found is extracted from model file The data characteristics that the corresponding class condition of fruit and class condition include, establishes data characteristics, class condition and classification results Between correspondence, obtain that there are the data characteristics of correspondence, class condition and classification results, and will have correspondence Data characteristics, class condition and classification results be converted to the script file of target machine language, obtain target script file, such as Shown in Fig. 8, to be called above-mentioned by code as shown in Figure 9 by the target script file of model file being analyzed and acquired by Target script file obtains the destination probability value on the left of " | " as shown in Figure 10, and the result on the right side of " | " is summed result, Figure 10 The result of shown top is the probability value obtained by xgboost models, can both it is almost the same.
As it can be seen that is provided through this embodiment parses the method pair of xgboost model files based on python script files Target data carries out classification and real-time grading or prediction of the xgboost models to large-scale data may be implemented.In certain fields The efficient migration that model may be implemented in (in the case of data distribution variation less), without relying on more libraries xgboost letter The library function of several or higher python versions and python itself.Large-scale data is handled to reduce disaggregated model Complexity when classification task.
Another aspect according to the ... of the embodiment of the present invention additionally provides a kind of classification side for implementing above-mentioned target data The electronic device of method, as shown in figure 11, which may include:One or more (one is only shown in figure) processors 1102, memory 1104, sensor 1106, encoder 1108 and transmitting device 1110.
Wherein, memory 1104 can be used for storing software program and module, such as the video image in the embodiment of the present invention Playback method and device.
Corresponding program instruction/module, processor 1102 by operation be stored in the software program in memory 1104 with And module, to perform various functions application and data processing, i.e. image encoding method.Memory 1104 may include high speed with Machine memory, can also include nonvolatile memory, such as one or more magnetic storage device, flash memory or other are non- Volatile solid-state.In some instances, memory 1104 can further comprise remotely located relative to processor 1102 Memory, these remote memories can pass through network connection to terminal.The example of above-mentioned network includes but not limited to interconnect Net, intranet, LAN, mobile radio communication and combinations thereof.
Above-mentioned transmitting device 1110 is used to receive via a network or transmission data.Above-mentioned network specific example It may include cable network and wireless network.In an example, transmitting device 1110 includes a network adapter (Network Interface Controller, NIC), can be connected with other network equipments with router by cable so as to interconnection Net or LAN are communicated.In an example, transmitting device 1110 is radio frequency (Radio Frequency, RF) module, For wirelessly being communicated with internet.
Optionally, the specific example in the present embodiment can refer to the example described in above-described embodiment, the present embodiment Details are not described herein.
It will appreciated by the skilled person that structure shown in Figure 11 is only to illustrate, electronic device can also be intelligence It can mobile phone (such as Android phone, iOS mobile phones), tablet computer, applause computer and mobile internet device (Mobile Internet Devices, MID), the terminal devices such as PAD.Figure 11 it does not cause to limit to the structure of above-mentioned electronic device.Example Such as, electronic device may also include than shown in Figure 11 more either less components (such as network interface, display device) or With the configuration different from shown in Figure 11.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can To be completed come command terminal device-dependent hardware by program, which can be stored in a computer readable storage medium In, storage medium may include:Flash disk, read-only memory (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), disk or CD etc..
The embodiments of the present invention also provide a kind of storage mediums.Optionally, in the present embodiment, above-mentioned storage medium can With at least one of multiple network equipments in network network equipment.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps:
S1 obtains model file, wherein model file is the file for storing object module, and object module is to use The model for executing classification task that sample data is trained;
S2 extracts multiple data characteristics, class condition and classification results with correspondence from model file, and Multiple data characteristics, class condition and classification results with correspondence are converted to the multiple functions for meeting object format, Generate target script file;
S3 classifies to target data based on target script file.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to:USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or The various media that can store program code such as CD.
Optionally, the specific example in the present embodiment can refer to showing described in above-described embodiment 1 and embodiment 2 Example, details are not described herein for the present embodiment.
The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.
If the integrated unit in above-described embodiment is realized in the form of SFU software functional unit and as independent product Sale in use, can be stored in the storage medium that above computer can be read.Based on this understanding, skill of the invention Substantially all or part of the part that contributes to existing technology or the technical solution can be with soft in other words for art scheme The form of part product embodies, which is stored in a storage medium, including some instructions are used so that one Platform or multiple stage computers equipment (can be personal computer, server or network equipment etc.) execute each embodiment institute of the present invention State all or part of step of method.
In the above embodiment of the present invention, all emphasizes particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed client, it can be by others side Formula is realized.Wherein, the apparatus embodiments described above are merely exemplary, for example, the unit division, only one Kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component can combine or It is desirably integrated into another system, or some features can be ignored or not executed.Another point, it is shown or discussed it is mutual it Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module It connects, can be electrical or other forms.
The unit illustrated as separating component may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (12)

1. a kind of sorting technique of target data, which is characterized in that including:
Obtain model file, wherein the model file is the file for storing object module, and the object module is to use The model for executing classification task that sample data is trained;
Multiple data characteristics, class condition and classification results with correspondence are extracted from the model file, and will The multiple data characteristics, class condition and classification results with correspondence are converted to the multiple letters for meeting object format Number generates target script file;
Classified to target data based on the target script file.
2. according to the method described in claim 1, it is characterized in that, based on the target script file to the target data into Row is classified:
Target data feature is extracted from the target data;
The target script file is called, and the target data feature is inputted into the target script file, obtains multiple mesh Mark classification results;
To the multiple target classification result performance objective operation, target operation result is obtained;
Determine the targets threshold range that the target operation result is fallen into multiple threshold ranges;
The corresponding target category label of the targets threshold range is determined as to the label of the target data, wherein described more A threshold range is corresponded with multiple class labels.
3. according to the method described in claim 2, it is characterized in that, to the multiple target classification result performance objective operation, Obtaining the target operation result includes:
Summation operation is carried out to the multiple target classification result, obtains summed result;
The summed result is converted into destination probability value using sigmoid functions, and the destination probability value is determined as institute State target operation result.
4. according to the method described in claim 1, it is characterized in that, being extracted from the model file multiple with corresponding pass Data characteristics, class condition and the classification results of system, and by the multiple data characteristics with correspondence, class condition and Classification results are converted to the multiple functions for meeting object format, generate target script file and include:
The classification results are searched from the model file;
The number that the corresponding class condition of the classification results found and class condition include is extracted from the model file According to feature;
The correspondence between the data characteristics, the class condition and the classification results is established, obtains described having pair Data characteristics, class condition and the classification results that should be related to;
Data characteristics, class condition and the classification results with correspondence are converted to the script text of target machine language Part obtains the target script file.
5. method according to claim 1 to 4, which is characterized in that the model file includes xgboost moulds Type file, the target script file include python script files.
6. a kind of sorter of target data, which is characterized in that including:
Acquisition module, for obtaining model file, wherein the model file is the file for storing object module, described Object module is the model for executing classification task being trained using sample data;
Processing module, for extracted from the model file multiple data characteristicses with correspondence, class condition and Classification results, and the multiple data characteristics, class condition and classification results with correspondence are converted to and meet target Multiple functions of format generate target script file;
Sort module classifies to target data for being based on the target script file.
7. device according to claim 6, which is characterized in that the sort module includes:
First extraction unit, for extracting target data feature from the target data;
Processing unit inputs the target script text for calling the target script file, and by the target data feature Part obtains multiple target classification results;
Execution unit, for the multiple target classification result performance objective operation, obtaining target operation result;
First determination unit, the targets threshold model fallen into multiple threshold ranges for determining the target operation result It encloses;
Second determination unit, for the corresponding target category label of the targets threshold range to be determined as the target data Label, wherein the multiple threshold range is corresponded with multiple class labels.
8. device according to claim 7, which is characterized in that the execution unit includes:
Subelement of summing obtains summed result for carrying out summation operation to the multiple target classification result;
Conversion subunit, for the summed result to be converted to destination probability value using sigmoid functions, and by the target Probability value is determined as the target operation result.
9. device according to claim 6, which is characterized in that the processing module includes:
Searching unit, for searching the classification results from the model file;
Second extraction unit, for extracting the corresponding class condition of classification results found from the model file and dividing The data characteristics that class condition includes;
Unit is established, for establishing the correspondence between the data characteristics, the class condition and the classification results, is obtained To data characteristics, class condition and the classification results with correspondence;
Converting unit, for data characteristics, class condition and the classification results with correspondence to be converted to target machine The script file of device language obtains the target script file.
10. the device according to any one of claim 6 to 9, which is characterized in that the model file includes xgboost Model file, the target script file include python script files.
11. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein when described program is run Execute the method described in 1 to 5 any one of the claims.
12. a kind of electronic device, including memory, processor and it is stored on the memory and can transports on the processor Capable computer program, which is characterized in that the processor executes the claims 1 to 5 by the computer program Method described in one.
CN201711499178.5A 2017-12-29 2017-12-29 Target data classification method and device, storage medium and electronic device Active CN108334895B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711499178.5A CN108334895B (en) 2017-12-29 2017-12-29 Target data classification method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711499178.5A CN108334895B (en) 2017-12-29 2017-12-29 Target data classification method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN108334895A true CN108334895A (en) 2018-07-27
CN108334895B CN108334895B (en) 2022-04-26

Family

ID=62924890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711499178.5A Active CN108334895B (en) 2017-12-29 2017-12-29 Target data classification method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN108334895B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213102A (en) * 2018-09-11 2019-01-15 深圳众城卓越科技有限公司 More order monitoring methods, device, computer equipment and storage medium
CN109240681A (en) * 2018-09-26 2019-01-18 郑州云海信息技术有限公司 A kind of model generating method, device and computer readable storage medium
CN109325217A (en) * 2018-09-19 2019-02-12 深圳市元征科技股份有限公司 A kind of document conversion method, system, device and computer readable storage medium
CN111090565A (en) * 2019-12-20 2020-05-01 上海有个机器人有限公司 Robot historical behavior playback method and system
CN111199244A (en) * 2019-12-19 2020-05-26 北京航天测控技术有限公司 Data classification method and device, storage medium and electronic device
CN111858085A (en) * 2020-06-12 2020-10-30 贝壳技术有限公司 Model file exporting method and device

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103426007A (en) * 2013-08-29 2013-12-04 人民搜索网络股份公司 Machine learning classification method and device
CN104391956A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Network update content detection method and device
CN104679831A (en) * 2015-02-04 2015-06-03 腾讯科技(深圳)有限公司 Method and device for matching human model
CN105278991A (en) * 2015-10-26 2016-01-27 中国科学院软件研究所 Construction method of cloud application deployment and configuration model
CN106022483A (en) * 2016-05-11 2016-10-12 星环信息科技(上海)有限公司 Method and equipment for conversion between machine learning models
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
CN106156809A (en) * 2015-04-24 2016-11-23 阿里巴巴集团控股有限公司 For updating the method and device of disaggregated model
US20160342903A1 (en) * 2015-05-21 2016-11-24 Software Ag Usa, Inc. Systems and/or methods for dynamic anomaly detection in machine sensor data
CN106502890A (en) * 2016-10-18 2017-03-15 乐视控股(北京)有限公司 Method for generating test case and system
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN107274888A (en) * 2017-06-14 2017-10-20 大连海事大学 A kind of Emotional speech recognition method based on octave signal intensity and differentiation character subset
CN107423815A (en) * 2017-08-07 2017-12-01 北京工业大学 A kind of computer based low quality classification chart is as data cleaning method
CN107423339A (en) * 2017-04-29 2017-12-01 天津大学 Popular microblogging Forecasting Methodology based on extreme Gradient Propulsion and random forest

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103426007A (en) * 2013-08-29 2013-12-04 人民搜索网络股份公司 Machine learning classification method and device
CN104391956A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Network update content detection method and device
CN104679831A (en) * 2015-02-04 2015-06-03 腾讯科技(深圳)有限公司 Method and device for matching human model
CN106156809A (en) * 2015-04-24 2016-11-23 阿里巴巴集团控股有限公司 For updating the method and device of disaggregated model
US20160342903A1 (en) * 2015-05-21 2016-11-24 Software Ag Usa, Inc. Systems and/or methods for dynamic anomaly detection in machine sensor data
CN105278991A (en) * 2015-10-26 2016-01-27 中国科学院软件研究所 Construction method of cloud application deployment and configuration model
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN106022483A (en) * 2016-05-11 2016-10-12 星环信息科技(上海)有限公司 Method and equipment for conversion between machine learning models
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
CN106502890A (en) * 2016-10-18 2017-03-15 乐视控股(北京)有限公司 Method for generating test case and system
CN107423339A (en) * 2017-04-29 2017-12-01 天津大学 Popular microblogging Forecasting Methodology based on extreme Gradient Propulsion and random forest
CN107274888A (en) * 2017-06-14 2017-10-20 大连海事大学 A kind of Emotional speech recognition method based on octave signal intensity and differentiation character subset
CN107423815A (en) * 2017-08-07 2017-12-01 北京工业大学 A kind of computer based low quality classification chart is as data cleaning method

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213102A (en) * 2018-09-11 2019-01-15 深圳众城卓越科技有限公司 More order monitoring methods, device, computer equipment and storage medium
CN109213102B (en) * 2018-09-11 2022-01-18 深圳众城卓越科技有限公司 Multi-command monitoring method and device, computer equipment and storage medium
CN109325217A (en) * 2018-09-19 2019-02-12 深圳市元征科技股份有限公司 A kind of document conversion method, system, device and computer readable storage medium
CN109325217B (en) * 2018-09-19 2023-04-18 深圳市元征科技股份有限公司 File conversion method, system, device and computer readable storage medium
CN109240681A (en) * 2018-09-26 2019-01-18 郑州云海信息技术有限公司 A kind of model generating method, device and computer readable storage medium
CN111199244A (en) * 2019-12-19 2020-05-26 北京航天测控技术有限公司 Data classification method and device, storage medium and electronic device
CN111199244B (en) * 2019-12-19 2024-04-09 北京航天测控技术有限公司 Data classification method and device, storage medium and electronic device
CN111090565A (en) * 2019-12-20 2020-05-01 上海有个机器人有限公司 Robot historical behavior playback method and system
CN111090565B (en) * 2019-12-20 2021-09-28 上海有个机器人有限公司 Robot historical behavior playback method and system
CN111858085A (en) * 2020-06-12 2020-10-30 贝壳技术有限公司 Model file exporting method and device
CN111858085B (en) * 2020-06-12 2024-06-07 贝壳技术有限公司 Method and device for exporting model file

Also Published As

Publication number Publication date
CN108334895B (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN108334895A (en) Sorting technique, device, storage medium and the electronic device of target data
CN108229478B (en) Image semantic segmentation and training method and device, electronic device, storage medium, and program
CN109325148A (en) The method and apparatus for generating information
CN110147711A (en) Video scene recognition methods, device, storage medium and electronic device
CN106951925A (en) Data processing method, device, server and system
CN111931809A (en) Data processing method and device, storage medium and electronic equipment
CN115511501A (en) Data processing method, computer equipment and readable storage medium
CN111680147A (en) Data processing method, device, equipment and readable storage medium
CN111260220B (en) Group control equipment identification method and device, electronic equipment and storage medium
CN111783712A (en) Video processing method, device, equipment and medium
CN114419363A (en) Target classification model training method and device based on label-free sample data
CN112199411B (en) Big data analysis method and artificial intelligence platform applied to cloud computing communication architecture
CN114706966A (en) Voice interaction method, device and equipment based on artificial intelligence and storage medium
CN113342489A (en) Task processing method and device, electronic equipment and storage medium
CN112182175A (en) Intelligent question answering method, device, equipment and readable storage medium
CN113778864A (en) Test case generation method and device, electronic equipment and storage medium
CN110196805A (en) Data processing method, device, storage medium and electronic device
CN117149996A (en) Man-machine interface digital conversation mining method and AI system for artificial intelligence application
CN107451194A (en) A kind of image searching method and device
CN115357720A (en) Multi-task news classification method and device based on BERT
CN112801053B (en) Video data processing method and device
CN115114462A (en) Model training method and device, multimedia recommendation method and device and storage medium
CN114298182A (en) Resource recall method, device, equipment and storage medium
CN113312445A (en) Data processing method, model construction method, classification method and computing equipment
CN112948251A (en) Automatic software testing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant