CN107025233B

CN107025233B - Data feature processing method and device

Info

Publication number: CN107025233B
Application number: CN201610066847.9A
Authority: CN
Inventors: 张研; 杨冠军; 蒋程诚
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Shenzhen yunwangwandian e-commerce Co.,Ltd.
Priority date: 2016-01-29
Filing date: 2016-01-29
Publication date: 2020-04-28
Anticipated expiration: 2036-01-29
Also published as: CN107025233A

Abstract

The embodiment of the invention discloses a data feature processing method and device, relates to the technical field of big data processing, and can reduce the cost of data extraction and improve the accuracy of data extraction. The method of the invention comprises the following steps: obtaining a plaintext sample from a service log, wherein the plaintext sample at least comprises a special field and a characteristic field, and the special field comprises a field for representing an execution command and an operation command; according to a pre-configured feature class, obtaining a feature plaintext from the feature field, and recording a sample signature, wherein special fields with the same content correspond to the same sample signature; extracting a special field corresponding to the sample signature, and splicing the obtained characteristic plaintext to the special field to obtain a spliced field; and outputting the spliced field as a feature sample. The method is suitable for data feature extraction in big data processing.

Description

Data feature processing method and device

Technical Field

The present invention relates to the field of big data processing technologies, and in particular, to a method and an apparatus for processing data characteristics.

Background

With the development of internet technology, the data volume of online data increases exponentially, and in order to deal with the processing of massive data, many big data processing schemes are developed to extract required information from massive data.

For data in different fields and different types, due to the large difference in data dimensions, formats and the like, the data sources are also complicated, so that a lot of computing resources are occupied to screen and extract required information from massive data. In the existing scheme, effective data features are extracted through a certain programming language mainly in a text processing or data table mode, so that data extraction is realized.

However, the data characteristics of the data table are single, and it is difficult to accurately describe the profile of the data really required by the user, thereby affecting the effects of subsequent data analysis and modeling. Particularly, in a service data processing system with a high refresh frequency, such as an advertisement system, frequent updating and modeling of large-scale and multidimensional advertisement data are required, the cost is high, but the accuracy of data extraction is still low.

Disclosure of Invention

Embodiments of the present invention provide a data feature processing method and apparatus, which can reduce data extraction cost and improve data extraction accuracy.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides a method for processing data characteristics, including:

obtaining a plaintext sample from a service log, wherein the plaintext sample at least comprises a special field and a characteristic field, and the special field comprises a field for representing an execution command and an operation command;

according to a pre-configured feature class, obtaining a feature plaintext from the feature field, and recording a sample signature, wherein special fields with the same content correspond to the same sample signature;

extracting a special field corresponding to the sample signature, and splicing the obtained characteristic plaintext to the special field to obtain a spliced field;

and outputting the spliced field as a feature sample.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the obtaining a plaintext sample from a traffic log includes:

reading a plaintext field in the service log;

culling a first type field from the plaintext fields; and/or converting characters of a second type field in the plaintext field into a specified form;

and storing the fields subjected to the elimination and/or conversion processing into a memory in a Map mode through a MapReduce framework.

With reference to the first aspect, in a second possible implementation manner of the first aspect, the obtaining a feature plaintext from the feature field according to a pre-configured feature class includes:

sequentially reading fields in the feature class, wherein the content of the fields in the feature class is the same as that of at least one field in the plaintext sample;

according to the content of the fields in the feature class, sequentially reading the fields with the same content from the plaintext sample as the feature fields;

recording the feature fields sequentially read from the plaintext samples in a feature set.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the outputting the spliced field as a feature sample includes:

importing the feature sample and the feature set into a Reduce stage through a MapReduce framework;

the recording the characteristic fields sequentially read from the plaintext sample in a characteristic set includes: outputting the same feature fields read from the plaintext samples to the same compute node.

With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the method further includes:

reading a basic feature class and updating the basic feature class through a reflection mechanism;

and taking the basic feature class which is updated last time as the pre-configured feature class.

In a second aspect, an embodiment of the present invention provides a data feature processing apparatus, including:

the system comprises an extraction unit, a processing unit and a control unit, wherein the extraction unit is used for acquiring a plaintext sample from a service log, the plaintext sample at least comprises a special field and a characteristic field, and the special field comprises a field for representing an execution command and an operation command;

the identification unit is used for acquiring a feature plaintext from the feature field according to a pre-configured feature class and recording a sample signature, wherein special fields with the same content correspond to the same sample signature;

the splicing unit is used for extracting a special field corresponding to the sample signature, and splicing the acquired feature plaintext to the special field to obtain a spliced field;

and the output unit is used for outputting the spliced field as a characteristic sample.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the apparatus further includes a preprocessing unit, configured to read a plaintext field in the service log; and eliminating a first type field from the plaintext field; and/or converting characters of a second type field in the plaintext field into a specified form; and storing the field subjected to the elimination and/or conversion processing into a memory in a Map mode through a MapReduce framework.

With reference to the second aspect, in a second possible implementation manner of the second aspect, the identifying unit is specifically configured to sequentially read fields in the feature class, where the fields in the feature class have the same content as at least one field in the plaintext sample; reading fields with the same content from the plaintext sample in sequence as the characteristic fields according to the content of the fields in the characteristic class; and recording the characteristic fields read from the plaintext samples in a characteristic set.

With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, the output unit is specifically configured to import the feature sample and the feature set into a Reduce stage through a MapReduce framework; and outputting the same characteristic field read from the plaintext sample to the same compute node.

With reference to the second aspect, in a fourth possible implementation manner of the second aspect, the system further includes a feature class management unit, configured to read a basic feature class and update the basic feature class through a reflection mechanism; and using the basic feature class updated most recently as the pre-configured feature class.

According to the data feature processing method and device provided by the embodiment of the invention, according to the pre-configured feature class, the feature plaintext is obtained from the feature field of the plaintext sample, the sample signature is recorded, a special field corresponding to the sample signature is extracted, the feature plaintext and the special field are spliced, and the spliced field is output as the feature sample and used as the feature sample for data extraction. Compared with the prior art, the method and the device have the advantages that the required features are extracted from the mass data, the problem that large-scale and multi-dimensional data are difficult to extract in the prior art, and the problem that modeling needs to be updated frequently is solved, so that the cost of data extraction is reduced, and the accuracy of data extraction is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for processing data characteristics according to an embodiment of the present invention;

fig. 3a, fig. 3b and fig. 3c are schematic structural diagrams of a data feature processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The present embodiment may adopt a MapReduce-based distributed processing framework (may also be referred to as a MapReduce framework), where a specific architecture of the MapReduce framework used in the present embodiment may be as shown in fig. 1. In the execution process, the data to be processed is stored in the memory in a map mode. If a MapReduce framework based on hadoop is adopted, extracting and outputting a characteristic field and a special field of data in a map stage for characteristic extraction, and accumulating the same characteristic field in a reduce stage; for the samples, sample extraction is carried out in the map stage, and the characteristic samples recorded with sample signatures are output in the reduce stage.

An embodiment of the present invention provides a data feature processing method, as shown in fig. 2, including:

and S1, acquiring a plaintext sample from the service log.

Wherein the plaintext sample comprises at least a special field and a characteristic field, and the special field comprises a field for representing an execution command and an operation command. The service log may be log data recorded when the service system runs, for example: log data recorded while the advertisement delivery system is running. The plaintext sample may be an unencrypted character in the service log, and the obtained plaintext sample may specifically be in a text form conforming to tab separation, and includes special fields for indicating "presence presentation" and "click", such as: "show" and "clk".

The processes S1-S4 may be specifically executed by a server in the map phase in the MapReduce framework.

And S2, acquiring feature plaintext from the feature field according to the pre-configured feature class, and recording a sample signature.

In this embodiment, the server at the map stage reads a pre-configured feature class, where the feature class includes fields configured in a sequence to the feature class, and the content of a field in the feature class is the same as that of at least one field in the plaintext sample. And the server at the map stage reads the input plaintext sample in a key-value mode according to the pre-configured feature class, and stores the plaintext sample in the memory in the map mode. The memory described in this embodiment may specifically be a memory of a local device of a user, or a memory of a server at a map stage.

The server in the map stage can strip the special fields for indicating 'presence presentation' and 'click' in the plaintext sample; and sequentially extracting the characteristic fields from the plaintext samples according to the field contents recorded in the preset characteristic class. The sample signature corresponds to the plaintext sample, and the special fields used for indicating "presence presentation" and "click" in the plaintext sample are repeated for many times, so that the special fields with the same content in the same plaintext sample correspond to the same sample signature. The sample signature may be distributed by the server when the plaintext sample is stored in the memory in a map manner, or may be preconfigured in the plaintext sample.

And S3, extracting a special field corresponding to the sample signature, and splicing the obtained characteristic plaintext to the special field to obtain a spliced field.

For example: for the plaintext samples: "show clk A …, show clk B …, show clk C …, show clkD",

wherein, the special field is "show clk", and the characteristic field is "a B C D", so that the characteristic can be obtained: splicing A show clk, B show clk, C show clk and D show clk to obtain spliced fields: "show clk feaAfeB feaC feaD".

And S4, outputting the spliced field as a feature sample.

Wherein, the server in the map phase can output the characteristic sample to the server in the reduce phase.

In this embodiment, for feature extraction, feature plaintext needs to be acquired from a feature field at a map stage according to a pre-configured feature class, and the pre-configured feature class can be acquired through a reflection mechanism in java, so that a user does not need to develop a feature extraction program based on a data table in the prior art for general requirements when extracting features; for special requirements, the required features are extracted from the mass data according to the pre-configured feature classes by only using the feature extraction framework (i.e., the MapReduce framework for running the execution flow of the embodiment) of the present embodiment.

The reflection mechanism employed in this embodiment includes: at the time of compiling, it is not determined which class needs to be loaded, but a specific class is loaded when the program runs, so that the structural attribute of the class is obtained. Classes that are not known at compile time are used. Such as: when a Class is loaded, the Java virtual machine automatically generates a Class object, and obtains information such as a method and a member corresponding to the Class object loaded in the virtual machine, and statement and definition of a construction method through the Class object. Specifically, for example, the process of obtaining the pre-configured feature class through the reflection mechanism in java may include:

using the java reflection mechanism, a Feature class factory class (Feature) is defined, as shown in the following code:

and configuring the feature class name under personal service configuration when extracting the features, wherein the configuration of a plurality of slots and a plurality of features is supported. And no early loading is required.

And then when the user configuration file is called, analyzing the user configuration file to obtain a feature class name according to the slot number and reflecting a feature analysis class for a feature extraction program to use to extract features. The method comprises the steps of adding any type of feature extraction service classes according to specific service requirements, configuring feature class names in configuration files, and using feature classes written by users for different slots during feature extraction. Further, the preprocessing class processing also defines a preprocessing factory class separately to utilize the reflection mechanism of java.

According to the data characteristic processing method provided by the embodiment of the invention, according to the preset characteristic class, the characteristic plaintext is obtained from the characteristic field of the plaintext sample, the sample signature is recorded, a special field corresponding to the sample signature is extracted, the characteristic plaintext and the special field are spliced, and the spliced field is output as the characteristic sample and is used as the characteristic sample for data extraction. Compared with the prior art, the method and the device have the advantages that the required features are extracted from the mass data, the problem that large-scale and multi-dimensional data are difficult to extract in the prior art, and the problem that modeling needs to be updated frequently is solved, so that the cost of data extraction is reduced, and the accuracy of data extraction is improved.

In this embodiment, the server at the map stage may also perform preprocessing on the plaintext sample stored in the memory in the map manner or on the field in the plaintext sample before the plaintext sample is stored in the memory, for example: the characters based on the encoding modes of URL-ENCODE, base64 and the like can be subjected to preprocessing such as half-angle full-angle conversion, English capital and small case conversion and the like, and can also comprise a user-defined preprocessing process. Thus, the obtaining of the plaintext sample from the service log includes:

and reading a plaintext field in the service log. And eliminating the first type field in the plaintext field. And/or converting characters of a second type field in the plaintext field into a specified form. And storing the fields subjected to the elimination and/or conversion processing into a memory in a Map mode through a MapReduce framework.

Wherein, the first type field refers to a field which has data error and can not be read, or a character for indicating specific content (for example, the character for indicating specific content may include a character for indicating modification date, a separator, etc.); the second type field refers to that a conversion can be made, such as: and the character is subjected to half-angle full-angle conversion or English capital and small case conversion, and the converted character form is a specified form preset by a user or a form prestored in a server at a map stage.

In this embodiment, the obtaining the feature plaintext from the feature field according to the pre-configured feature class includes:

and sequentially reading fields in the feature classes. And according to the content of the fields in the characteristic class, sequentially reading the fields with the same content from the plaintext sample as the characteristic fields. And recording the characteristic fields read from the plaintext samples in a characteristic set.

Wherein the content of the field in the feature class is the same as the content of at least one field in the plaintext sample. Specifically, the server at the map stage obtains a new plaintext sample set, initializes the pre-configured feature classes to be extracted, and calls the feature classes one by one for feature extraction according to the configured features to be extracted. For example:

the plaintext samples are: "show clk A B C D";

the pre-configured feature classes include:

Feaclass＝featureclass1；dpd＝A；slot＝1，

Feaclass＝featureclass2；dpd＝B；slot＝2，

Feaclass＝featureclass3；dpd＝C；slot＝3，

Feaclass＝featureclass4；dpd＝D；slot＝4，

the server can initialize featureclas 1, featureclas 2, featureclas 3 and featureclas 4, and then sequentially extract features feaA and feaB to feaD according to the configuration sequence. The server extracts feature sets { feaA, feaB, feaC, feaD }, and plaintext samples show clk a B C D, and completes the splicing process according to the relationship between the special fields and the feature fields, where the relationship between the fields may include: { feaA show clk … }, and finally completing the splicing to obtain a feature sample: show clkfeaAfeaBfeaCfeaD.

In this embodiment, the outputting the spliced field as a feature sample includes:

and importing the feature sample and the feature set into a Reduce phase through a MapReduce framework. The recording the characteristic fields sequentially read from the plaintext sample in a characteristic set includes: outputting the same feature fields read from the plaintext samples to the same compute node.

For example: the embodiment may adopt a MapReduce framework of hadoop, and the server in the map phase executes S1-S4, and then outputs the execution result (the execution result includes the feature sample and the feature set) to the server in the reduce phase. Specifically, if the feature sample is the feature sample, the feature sample is directly output to the reduce without being processed; and if the feature set is the feature set, the same features are distributed into the same computing nodes by utilizing the bucket distribution principle of a MapReduce framework. The server in the reduce stage directly outputs the characteristic sample when receiving the characteristic sample; and accumulating the show clk value corresponding to the feature set and outputting the show clk value after receiving the feature set.

In this embodiment, the method further includes:

the basic feature class is read and updated by a reflection mechanism.

An embodiment of the present invention further provides a processing apparatus for data characteristics, which may specifically operate in a server at a map stage if the processing apparatus is applied to a MapReduce framework, as shown in fig. 3a, and the processing apparatus includes:

the extraction unit is used for obtaining a plaintext sample from the service log, wherein the plaintext sample at least comprises a special field and a characteristic field, and the special field comprises a field used for representing an execution command and an operation command.

And the identification unit is used for acquiring a characteristic plaintext from the characteristic field according to a preset characteristic class and recording a sample signature, wherein the special fields with the same content correspond to the same sample signature.

And the splicing unit is used for extracting a special field corresponding to the sample signature, and splicing the acquired feature plaintext to the special field to obtain a spliced field.

In this embodiment, the identification unit is specifically configured to sequentially read fields in the feature class, where the content of a field in the feature class is the same as that of at least one field in the plaintext sample. And according to the content of the fields in the characteristic class, sequentially reading the fields with the same content from the plaintext sample as the characteristic fields. And recording the characteristic fields read from the plaintext samples in a characteristic set.

In this embodiment, the output unit is specifically configured to import the feature sample and the feature set into a Reduce phase through a MapReduce framework. And outputting the same characteristic field read from the plaintext sample to the same compute node.

Further, as shown in fig. 3b, the method further includes: and the preprocessing unit is used for reading a plaintext field in the service log. And culling a first type field from the plaintext fields. And/or converting characters of a second type field in the plaintext field into a specified form. And storing the field subjected to the elimination and/or conversion processing into a memory in a Map mode through a MapReduce framework.

Further, as shown in fig. 3c, the system further includes a feature class management unit, configured to read a basic feature class and update the basic feature class through a reflection mechanism. And using the basic feature class updated most recently as the pre-configured feature class.

According to the data feature processing device provided by the embodiment of the invention, according to the pre-configured feature class, the feature plaintext is obtained from the feature field of the plaintext sample, the sample signature is recorded, a special field corresponding to the sample signature is extracted, the feature plaintext is spliced with the special field, and the spliced field is output as the feature sample and used as the feature sample for data extraction. Compared with the prior art, the method and the device have the advantages that the required features are extracted from the mass data, the problem that large-scale and multi-dimensional data are difficult to extract in the prior art, and the problem that modeling needs to be updated frequently is solved, so that the cost of data extraction is reduced, and the accuracy of data extraction is improved.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for processing data features, comprising:

outputting the spliced field as a feature sample;

further comprising:

the plaintext sample is an unencrypted character in the service log, and the obtained plaintext sample is expressed in a text form conforming to tab separation and comprises special fields for expressing 'presence showing' and 'clicking';

reading a pre-configured feature class by a server at a map stage, wherein the feature class comprises fields configured according to a sequence into the feature class, and the content of the fields in the feature class is the same as that of at least one field in the plaintext sample; the server at the map stage reads an input plaintext sample in a key-value mode according to a pre-configured feature class, and stores the plaintext sample in a memory in the map mode;

the server in the map stage strips special fields used for representing 'presence presentation' and 'click' in the plaintext sample; and sequentially extracting characteristic fields from the plaintext samples according to field contents recorded in a preset characteristic class.

2. The method according to claim 1, wherein the obtaining the feature plaintext from the feature field according to the pre-configured feature class comprises:

3. The method of claim 2, wherein outputting the concatenated field as a feature sample comprises:

4. The method of claim 1, further comprising:

5. An apparatus for processing data features, comprising:

the output unit is used for outputting the spliced fields as characteristic samples;

further comprising:

6. The apparatus according to claim 5, wherein the identifying unit is specifically configured to sequentially read fields in the feature class, where the fields in the feature class have the same content as at least one field in the plaintext sample; reading fields with the same content from the plaintext sample in sequence as the characteristic fields according to the content of the fields in the characteristic class; and recording the characteristic fields read from the plaintext samples in a characteristic set.

7. The apparatus according to claim 6, wherein the output unit is configured to import the feature sample and the feature set into a Reduce phase, in particular via a MapReduce framework; and outputting the same characteristic field read from the plaintext sample to the same compute node.

8. The apparatus according to claim 5, further comprising a feature class management unit for reading a basic feature class and updating the basic feature class through a reflection mechanism; and using the basic feature class updated most recently as the pre-configured feature class.