CN115017145A - Data expansion method, device and storage medium - Google Patents
Data expansion method, device and storage medium Download PDFInfo
- Publication number
- CN115017145A CN115017145A CN202210625648.2A CN202210625648A CN115017145A CN 115017145 A CN115017145 A CN 115017145A CN 202210625648 A CN202210625648 A CN 202210625648A CN 115017145 A CN115017145 A CN 115017145A
- Authority
- CN
- China
- Prior art keywords
- data
- target
- expanded
- expansion
- service scene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000012545 processing Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 description 15
- 238000013145 classification model Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000007711 solidification Methods 0.000 description 1
- 230000008023 solidification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2379—Updates performed during online database operations; commit processing
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the invention discloses a data expansion method, a data expansion device and a storage medium. The method comprises the following steps: acquiring a reference data set in a target service scene and a data set to be expanded of an associated service scene associated with the target service scene; determining reference characteristic data of each data in the reference data set, and determining target characteristic data in the reference characteristic data; and selecting first data to be expanded with the target characteristic data from the data set to be expanded, and taking the first data to be expanded as target expanded data of the target service scene. According to the technical scheme in the embodiment of the invention, the data can be more quickly and effectively expanded, so that the time consumption of data expansion is reduced, and the efficiency of data expansion is further improved.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data expansion method, an apparatus, and a storage medium.
Background
In the related art, in a manner of expanding data in a target service scenario, data similarity between the data in the target service scenario and data in other service scenarios generally needs to be calculated. Therefore, data needing to be expanded are selected from other service scenes according to the data similarity. This approach requires a full amount of computation on the data in the scene. Once the data size of the data in the scene is large, the existing data expansion method not only takes a lot of time, but also has the problem of low efficiency of data expansion.
Disclosure of Invention
The invention provides a data expansion method, a data expansion device and a storage medium, which are used for realizing more rapid and effective data expansion, thereby reducing the time consumption of data expansion and further improving the efficiency of data expansion.
According to an aspect of the present invention, there is provided a data extension method, the method including:
acquiring a reference data set in a target service scene and a data set to be expanded of an associated service scene associated with the target service scene;
determining reference characteristic data of each data in the reference data set, and determining target characteristic data in the reference characteristic data;
and selecting first data to be expanded with the target characteristic data in the data set to be expanded, and taking the first data to be expanded as target expanded data of the target service scene.
Optionally, determining the target feature data in the reference feature data includes:
determining forward characteristic data associated with the target service scene in the reference characteristic data according to the importance coefficient of the reference characteristic data;
and determining target characteristic data in the forward characteristic data based on the interval duration between the current time and the target updating time of the forward characteristic data.
Optionally, the determining the target feature data in the forward feature data based on the interval duration between the current time and the target update time includes:
calculating an update metric value of the forward characteristic data according to an interval duration between the current time and the target update time, wherein the update metric value is used for measuring the update interval duration of the forward characteristic data;
and selecting an updating metric value meeting a preset metric condition from the updating metric values as a target metric value, and taking forward characteristic data corresponding to the target metric value as target characteristic data.
Optionally, the calculating, according to the interval duration between the current time and the target update time, an update metric value of the forward feature data includes:
calculating an updated metric value of the forward characteristic data according to the following formula:
wherein, X i ' represents an updated metric value, X, of the forward characteristic data i Indicating the interval duration between the current time and the update time of the forward characteristic data.
Optionally, the method further comprises:
determining feature data to be expanded of each data in the data set to be expanded;
selecting second expansion data of the target service scene in the data set to be expanded according to each feature data to be expanded;
the taking the first extension data as the target extension data of the target service scenario includes:
all users in the first extended data and the second extended data are used as target extended data of the target service scene; or the like, or a combination thereof,
and taking a public user in the first extended data and the second extended data as target extended data of the target service scene.
Optionally, the selecting, according to each feature data to be extended, second extension data of the target service scenario in the data set to be extended includes:
clustering the target characteristic data to obtain a clustering center;
calculating the data distance between each feature data to be expanded and the clustering center;
and selecting second expansion data of the target service scene in the data set to be expanded based on the data distances.
Optionally, the selecting, according to each feature data to be extended, second extension data of the target service scenario in the data set to be extended includes:
respectively inputting the feature data to be expanded into data expansion models for performing data expansion according to different expansion dimensions to obtain data expansion values output by each data expansion model;
and carrying out weighted average processing on each data expansion value to obtain a target expansion value, and selecting second expansion data in the target service scene in the data set to be expanded according to the target expansion value.
Optionally, the method further comprises:
acquiring sample data and expected output data corresponding to the sample data aiming at the initial network model of each extended dimension, wherein the sample data comprises positive sample data and negative sample data, the positive sample data comprises the reference characteristic data, and the negative sample data is characteristic data of other service scenes except the target service scene;
inputting the sample data into the initial network model to obtain actual output data of the initial network model;
and adjusting parameters of the initial network model according to the actual output data and the expected output data of the sample data to obtain a data expansion model.
According to another aspect of the present invention, there is provided a data expansion apparatus. The device includes:
the data set acquisition module is used for acquiring a reference data set in a target service scene and a data set to be expanded of a related service scene related to the target service scene;
the characteristic data determining module is used for determining reference characteristic data of each data in the reference data set and determining target characteristic data in the reference characteristic data;
and the target expansion data acquisition module is used for selecting first to-be-expanded data with the target characteristic data in the to-be-expanded data set and using the first to-be-expanded data as target expansion data of the target service scene.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the data expansion method of any of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the data expansion method according to any one of the embodiments of the present invention when the computer instructions are executed.
According to the technical scheme of the embodiment of the invention, the reference data set in the target service scene and the data set to be expanded of the associated service scene associated with the target service scene are obtained, so that the data expansion range can be determined more quickly. And then, reference characteristic data of each data in the reference data set can be determined, target characteristic data in the reference characteristic data can be determined, and conditions for expanding the data can be obtained more accurately. After the target characteristic data is determined, the first data to be expanded with the target characteristic data can be selected from the data set to be expanded, and the first data to be expanded is used as the target expansion data of the target service scene.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a data expansion method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a data expansion method according to a second embodiment of the present invention;
fig. 3 is a schematic flowchart of a data expansion method according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data expansion apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
It will be appreciated that the data involved in the subject technology, including but not limited to the data itself, the acquisition or use of the data, should comply with the requirements of the corresponding laws and regulations and related regulations.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
Fig. 1 is a flowchart illustrating a data expansion method according to an embodiment of the present invention, where the embodiment is applicable to a case of expanding data, the method may be executed by a data expansion apparatus, the data expansion apparatus may be implemented in a form of hardware and/or software, and the data expansion apparatus may be configured in an electronic device such as a computer or a server.
As shown in fig. 1, the method of the present embodiment includes:
s110, acquiring a reference data set in a target service scene and a data set to be expanded of an associated service scene associated with the target service scene.
The target service scenario may be understood as a service scenario that needs data expansion currently. An associated service scenario may be understood as a service scenario associated with a target service scenario. The number of associated service scenarios may be one, two or more. A reference data set may be understood as a collection of all or part of the data in a target business scenario. Alternatively, the reference data set may be a set of users in the target service scenario. A data set to be expanded may be understood as a collection of all or part of the data in an associated business scenario.
Optionally, the associated service scenario associated with the target service scenario is determined by:
determining each service processing scene except the target service scene; and determining whether the service processing scenes except the target service scene are related service scenes or not according to the scene characteristics of the service processing scenes except the target service scene and the scene characteristics of the target service scene aiming at each service processing scene except the target service scene.
Specifically, determining whether a service processing scene other than a target service scene is an associated service scene according to a scene feature of the service processing scene other than the target service scene and a scene feature of the target service scene includes: calculating scene similarity of scene features of the service processing scenes except the target service scene and the scene features of the target service scene; if the scene similarity accords with a preset scene similarity threshold, the service processing scenes except the target service scene can be used as the associated service scenes; if the scene similarity does not meet a preset scene similarity threshold, determining that the service processing scene except the target service scene is not the associated service scene. The scene similarity threshold may be set according to actual requirements, such as 0.85, 0.9, or 0.95, and is not specifically limited herein.
And S120, determining reference characteristic data of each data in the reference data set, and determining target characteristic data in the reference characteristic data.
The reference feature data may be feature data obtained by extracting features of each data in the reference data (for example, feature data such as age, region, sex, business behavior, and the like of each user in the user set). The number of reference feature data may be one, two, or more than two. The target feature data may be one or more feature data in the reference feature data.
Specifically, after the reference data set is obtained, data features of each data in the reference data set are extracted. Thereby reference characteristic data of each data in the reference data set can be obtained. After obtaining the reference feature data, feature data meeting the preset feature data selection condition can be selected from the reference feature data based on the preset feature data selection condition, and the selected feature data is used as target feature data.
In an embodiment of the present invention, the reference characteristic data of each data in the reference data set is determined by:
a data tag is obtained for each data in the reference dataset. Furthermore, each data tag can be analyzed, so that the tag content included in each data tag, that is, the data feature of the data corresponding to each data tag, that is, the reference feature data of each data in the reference data set can be obtained. It will be appreciated that the data tags referencing each data in the data set may include
On the basis of the above embodiment, the data expansion method provided by the embodiment of the present invention further includes: and a data label is added to each data in the reference data set, so that the data island problem can be avoided. There are various ways to add data tags to the data in the reference dataset.
As an optional implementation manner in the embodiment of the present invention, adding a data tag to each data in a reference data set includes: and extracting the data characteristics of each data in the data set so as to obtain the data characteristics of each data in the reference data set. Therefore, corresponding data labels can be added to the data in the reference data set according to the data characteristics of the data in the reference data set.
As another optional embodiment mode in the embodiment of the present invention, adding a data tag to each data in a reference data set includes: and inputting each data in the reference data set into a data classification model trained in advance to obtain a data label of each data in the reference data set. And then adding a data label to each data in the reference data set.
The data classification model may be a hybrid model obtained by combining a Word vector module (e.g., Word2Vec) and a text convolutional neural network module (e.g., TextCNN). In the embodiment of the invention, Word2Vec can improve the effect of data enhancement. The TextCNN can improve the fitting speed and has better fitting effect. Compared with the prior art, the data classification model in the embodiment of the invention comprises a Word2Vec-TextCNN mixed model, has better applicability and stronger ductility, and can improve the data classification effect.
In the embodiment of the present invention, the data classification model may be obtained by:
acquiring sample data and expected data corresponding to the sample data, wherein the sample data can be data in various service scenes, and the expected data can be expected data tags of the data in each service scene;
inputting sample data into a pre-constructed data classification model to obtain an actual data label output by the data classification model; and adjusting model parameters of the pre-constructed data classification model based on the expected data label and the actual data label to obtain the trained data classification model.
S130, selecting first data to be expanded with the target characteristic data from the data set to be expanded, and taking the first data to be expanded as target expanded data of the target service scene.
The first data to be expanded can be understood as data with target characteristic data in the data set to be expanded. The target expansion data may be understood as data that meets a data expansion condition in the data set to be expanded, i.e., the first data to be expanded. Wherein the data expansion condition is data with target characteristic data.
Specifically, after the target feature data is determined, the feature data of each data in the data set to be expanded is compared with the target feature data in a consistent manner. Thus, the alignment result can be obtained. If the comparison result is consistent, the data corresponding to the feature data consistent with the comparison result of the target feature data in the data set to be expanded, that is, the first data to be expanded, may be determined. After the data to be expanded is determined, the first data to be expanded may be used as target expansion data of a target service scenario.
It should be noted that, in the embodiment of the present invention, a data table for storing data is obtained by performing data processing using a spark distributed computing platform and a python development framework. After the data table is obtained, the error data can be checked and cleaned through rules, and the vacant data can be filled through data backfill, so that the data in the reference data set in the target service scene and the data in the data set to be expanded of the associated service scene associated with the target service scene are obtained.
According to the technical scheme of the embodiment of the invention, the reference data set in the target service scene and the data set to be expanded of the associated service scene associated with the target service scene are obtained, so that the data expansion range can be determined more quickly. And then, reference characteristic data of each data in the reference data set can be determined, target characteristic data in the reference characteristic data can be determined, and conditions for expanding the data can be obtained more accurately. After the target characteristic data is determined, the first data to be expanded with the target characteristic data can be selected from the data set to be expanded, and the first data to be expanded is used as the target expansion data of the target service scene.
Example two
Fig. 2 is a schematic flow chart of a data expansion method according to a second embodiment of the present invention, where on the basis of the foregoing embodiment, optionally determining target feature data in the reference feature data includes: determining forward characteristic data associated with the target service scene in the reference characteristic data according to the importance coefficient of the reference characteristic data; determining the target feature data in the forward feature data based on an interval duration between the current time and the target update time of the forward feature data, wherein technical terms that are the same as or corresponding to those in the above embodiment are not repeated herein.
As shown in fig. 2, the method of this embodiment specifically includes:
s210, acquiring a reference data set in a target service scene and a data set to be expanded of an associated service scene associated with the target service scene.
And S220, determining reference characteristic data of each data in the reference data set.
And S230, determining forward characteristic data associated with the target service scene in the reference characteristic data according to the importance coefficient of the reference characteristic data.
Wherein the importance coefficient may be used to characterize the importance of the reference characteristic data. Optionally, the importance coefficient may be a tgi (target Group index) index. The forward feature data may be understood as feature data that is highly associated with the target service scenario among the reference feature data. The number of forward direction characteristic data may be one, two or more.
Specifically, the importance coefficient of each reference feature data is obtained. And then, based on a preset coefficient threshold, determining reference feature data corresponding to the importance degree coefficient exceeding the preset coefficient threshold, and using the reference feature data as forward feature data associated with the target service scene. The preset coefficient threshold may be set according to the actual data expansion demand, and is not specifically limited herein.
It is understood that if the numerical value of the importance degree coefficient of the reference feature data is larger, it indicates that the importance degree of the reference feature data is higher. Conversely, if the numerical value of the importance coefficient of the reference feature data is smaller, it indicates that the importance of the reference feature data is lower.
S240, determining target characteristic data in the forward characteristic data based on the interval duration between the current time and the target updating time of the forward characteristic data.
The target update time may be the time when the forward feature data is updated last time, or may be the time when the forward feature data is added for the first time.
Specifically, the target update time of the forward feature data may be set as the target update time of the forward feature data. And further, the interval duration of the target updating time and the current time can be calculated. If the interval duration does not exceed the preset interval duration threshold, the forward characteristic data corresponding to the interval duration not exceeding the preset interval duration threshold can be used as the target characteristic data. It can be understood that if the interval duration exceeds the preset interval duration threshold, the forward feature data corresponding to the interval duration exceeding the preset interval duration may be determined to be historical forward feature data. The method has the advantages that the target feature data in the forward feature data can be effectively selected, and the timeliness of data expansion is further improved.
In this embodiment of the present invention, the determining the target feature data in the forward feature data based on the interval duration between the current time and the target update time includes: calculating to obtain an updating metric value of the forward characteristic data according to the interval duration between the current time and the target updating time; and selecting an updating metric value meeting a preset metric condition from the updating metric values as a target metric value, and taking forward characteristic data corresponding to the target metric value as target characteristic data. And the updating metric value is used for measuring the updating interval duration of the forward characteristic data. The update metric value may embody the novelty of the forward characteristic data. The preset measurement condition can be set according to actual requirements. The updated metric values corresponding to different preset metric conditions may be the same or different. The target metric value may be an updated metric value that meets a preset metric condition.
Optionally, the updated metric value of the forward characteristic data may be calculated according to the following formula:
wherein, X i ' represents an updated metric value, X, of the forward characteristic data i Indicating the interval duration between the current time and the update time of the forward characteristic data.
S250, selecting first data to be expanded with the target characteristic data from the data set to be expanded, and taking the first data to be expanded as target expanded data of the target service scene.
According to the technical scheme of the embodiment of the invention, forward characteristic data associated with the target service scene in the reference characteristic data is determined according to the importance coefficient of the reference characteristic data; and determining the target characteristic data in the forward characteristic data based on the interval duration between the current time and the target updating time of the forward characteristic data, so that the target characteristic data in the forward characteristic data can be more accurately determined, and the accuracy of data expansion is ensured.
EXAMPLE III
Fig. 3 is a schematic flow chart of a data expansion method according to a third embodiment of the present invention, and on the basis of the foregoing embodiment, optionally, the data expansion method implemented by the present invention further includes: determining feature data to be expanded of each data in the data set to be expanded; selecting second expansion data of the target service scene in the data set to be expanded according to each feature data to be expanded; the taking the first extension data as the target extension data of the target service scenario includes: all users in the first extended data and the second extended data are used as target extended data of the target service scene; or, the public user in the first extended data and the second extended data is used as the target extended data of the target service scene. The technical terms that are the same as or corresponding to the above embodiments are not repeated herein.
As shown in fig. 3, the method of this embodiment specifically includes:
s310, a reference data set in a target service scene and a data set to be expanded of a related service scene related to the target service scene are obtained.
S320, determining reference characteristic data of each data in the reference data set, and determining target characteristic data in the reference characteristic data.
S330, selecting first data to be expanded with the target characteristic data from the data set to be expanded.
S340, determining the feature data to be expanded of each data in the data set to be expanded.
The feature data to be expanded may be feature data obtained by performing data feature extraction on data in a data set to be expanded.
And S350, selecting second expansion data of the target service scene in the data set to be expanded according to each feature data to be expanded.
The second expansion data may be data selected from the data set to be expanded based on feature data to be expanded of the data in the data set to be expanded.
In the embodiment of the present invention, there are various ways to select the second expansion data of the target service scene in the data set to be expanded according to each feature data to be expanded.
As an optional implementation manner in the embodiment of the present invention, the selecting, according to each feature data to be extended, second extension data of the target service scenario in the data set to be extended includes: clustering the target characteristic data to obtain a clustering center; calculating the data distance between each feature data to be expanded and the clustering center; and selecting second expansion data of the target service scene in the data set to be expanded based on each data distance. The data distance can be understood as the data similarity between each feature data to be expanded and the cluster center.
Selecting second expansion data of the target service scene in the data set to be expanded based on the data distances comprises the following steps: and sequencing the data distances in a descending order, and determining the feature data to be expanded corresponding to the data distance exceeding a preset data distance threshold as second expansion data. The preset data distance threshold value can be set according to actual requirements, so that data within a certain range from each clustering center is obtained and used as data to be expanded.
As another optional implementation manner in this embodiment of the present invention, the selecting, according to each feature data to be extended, second extension data of the target service scenario in the data set to be extended includes: respectively inputting the feature data to be expanded into data expansion models for data expansion aiming at different expansion dimensions to obtain data expansion values output by the data expansion models; and carrying out weighted average processing on each data expansion value to obtain a target expansion value, and selecting second expansion data in the target service scene in the data set to be expanded according to the target expansion value.
In the embodiment of the invention, the data extension model is obtained by the following method:
acquiring sample data and expected output data corresponding to the sample data aiming at the initial network model of each extended dimension; inputting the sample data into the initial network model to obtain actual output data of the initial network model; and adjusting parameters of the initial network model according to the actual output data and the expected output data of the sample data to obtain a data expansion model.
The sample data comprises positive sample data and negative sample data, the positive sample data comprises the reference characteristic data, and the negative sample data is characteristic data of other service scenes except the target service scene. In order to avoid the problem of unbalance of sample data of the positive sample data and the negative sample data, the embodiment of the invention can carry out random downsampling processing on the positive sample data and the negative sample data so as to balance the positive sample data and the negative sample data, thereby realizing sample equalization. On the basis, in order to avoid the phenomenon of loss of spatial information of sample data, multiple random downsampling can be carried out in the model training process to obtain a sampling result. Further, the sampling results may be averaged.
It should be noted that, in the embodiment of the present invention, the initial network model of each extension dimension is trained, and data extension models of different dimensions can be obtained, so as to support the requirement of data extension of a multi-service scenario. The models can be integrated by adopting machine learning algorithms such as logistic regression, random forest, XGboost, support vector machine, neural network and the like, and are fused by adopting a weighted average method, and weights are generated according to the accuracy of each technology, so that the final prediction result is obtained. In the embodiment of the invention, data expansion is implemented by adopting a mode of fusing various machine learning algorithms, so that the problems of algorithm stability and universality under different application scenes are effectively solved, the solidification of a data expansion technology is facilitated, and the prediction accuracy is improved.
S360, obtaining target extended data based on the first extended data and the second extended data.
In the embodiment of the present invention, there are various ways to obtain the target extension data based on the first extension data and the second extension data.
As an optional implementation manner in the embodiment of the present invention, obtaining target extension data based on the first extension data and the second extension data includes: all users in the first extended data and the second extended data are used as target extended data of the target service scene; as another optional implementation manner in the embodiment of the present invention, obtaining target extension data based on the first extension data and the second extension data includes: and taking a public user in the first extended data and the second extended data as target extended data of the target service scene.
According to the technical scheme of the embodiment of the invention, on the basis of the embodiment, the characteristic data to be expanded of each data in the data set to be expanded is determined; and selecting second expansion data of the target service scene in the data set to be expanded according to each feature data to be expanded. All users in the first extended data and the second extended data can be used as target extended data of the target service scene; or, the public user in the first extension data and the second extension data is used as the target extension data of the target service scene, so that the data in the target service scene is effectively extended, and the user requirements are further met.
Example four
Fig. 4 is a schematic structural diagram of a data expansion apparatus according to a fourth embodiment of the present invention. As shown in fig. 4, the apparatus includes: a data set acquisition module 410, a feature data determination module 420, and a target extension data acquisition module 430.
The data set obtaining module 410 is configured to obtain a reference data set in a target service scene and a data set to be expanded of an associated service scene associated with the target service scene;
a feature data determining module 420, configured to determine reference feature data of each data in the reference data set, and determine target feature data in the reference feature data;
the target extended data obtaining module 430 is configured to select first to-be-extended data with the target feature data from the to-be-extended data set, and use the first to-be-extended data as target extended data of the target service scenario.
According to the technical scheme of the embodiment of the invention, the reference data set in the target service scene and the data set to be expanded of the associated service scene associated with the target service scene are obtained through the data set obtaining module, so that the data expansion range can be determined more quickly. Furthermore, the reference characteristic data of each data in the reference data set can be determined through the characteristic data determination module, and the target characteristic data in the reference characteristic data can be determined, so that the condition for expanding the data can be obtained more accurately. After the target characteristic data is determined, first to-be-expanded data with the target characteristic data can be selected from the to-be-expanded data set through a target expanded data acquisition module, and the first to-be-expanded data is used as the target expanded data of the target service scene.
Optionally, the feature data determination module 420 comprises a forward feature data determination unit and a target feature data determination unit; wherein,
the forward characteristic data determining unit is configured to determine, according to the importance coefficient of the reference characteristic data, forward characteristic data associated with the target service scene in the reference characteristic data;
the target characteristic data determining unit is used for determining target characteristic data in the forward characteristic data based on the interval duration between the current time and the target updating time of the forward characteristic data.
Optionally, the target feature data determining unit is configured to:
calculating an update metric value of the forward characteristic data according to an interval duration between the current time and the target update time, wherein the update metric value is used for measuring the update interval duration of the forward characteristic data;
and selecting an updating metric value meeting a preset metric condition from the updating metric values as a target metric value, and taking forward characteristic data corresponding to the target metric value as target characteristic data.
Optionally, the target feature data determining unit is configured to:
calculating an updated metric value of the forward characteristic data according to the following formula:
wherein, X i ' represents an updated metric value, X, of the forward characteristic data i Indicating the interval duration between the current time and the update time of the forward characteristic data.
Optionally, the apparatus further comprises: a second extended data selection module, configured to:
determining feature data to be expanded of each data in the data set to be expanded;
selecting second expansion data of the target service scene in the data set to be expanded according to each feature data to be expanded;
a target extension data acquisition module 430, comprising:
all users in the first extended data and the second extended data are used as target extended data of the target service scene; or,
and taking a public user in the first extended data and the second extended data as target extended data of the target service scene.
Optionally, the second extended data selecting module is specifically configured to:
clustering the target characteristic data to obtain a clustering center;
calculating the data distance between each feature data to be expanded and the clustering center;
and selecting second expansion data of the target service scene in the data set to be expanded based on the data distances.
Optionally, the second extended data selecting module is specifically configured to:
respectively inputting the feature data to be expanded into data expansion models for data expansion aiming at different expansion dimensions to obtain data expansion values output by the data expansion models;
and carrying out weighted average processing on each data expansion value to obtain a target expansion value, and selecting second expansion data in the target service scene in the data set to be expanded according to the target expansion value.
Optionally, the apparatus further comprises: a model training module to:
acquiring sample data and expected output data corresponding to the sample data aiming at the initial network model of each extended dimension, wherein the sample data comprises positive sample data and negative sample data, the positive sample data comprises the reference characteristic data, and the negative sample data is characteristic data of other service scenes except the target service scene;
inputting the sample data into the initial network model to obtain actual output data of the initial network model;
and adjusting parameters of the initial network model according to the actual output data and the expected output data of the sample data to obtain a data expansion model.
The data expansion device provided by the embodiment of the invention can execute the data expansion method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
It should be noted that, the units and modules included in the data expansion apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiment of the invention.
EXAMPLE five
FIG. 5 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM)12, a Random Access Memory (RAM)13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM)12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
In some embodiments, the data expansion method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the data expansion method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the data expansion method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method of data expansion, comprising:
acquiring a reference data set in a target service scene and a data set to be expanded of an associated service scene associated with the target service scene;
determining reference characteristic data of each data in the reference data set, and determining target characteristic data in the reference characteristic data;
and selecting first data to be expanded with the target characteristic data in the data set to be expanded, and taking the first data to be expanded as target expanded data of the target service scene.
2. The method of claim 1, wherein determining the target feature data in the reference feature data comprises:
determining forward characteristic data associated with the target service scene in the reference characteristic data according to the importance coefficient of the reference characteristic data;
and determining target characteristic data in the forward characteristic data based on the interval duration between the current time and the target updating time of the forward characteristic data.
3. The method according to claim 2, wherein the determining the target feature data in the forward feature data based on the interval duration between the current time and the target update time comprises:
calculating an update metric value of the forward characteristic data according to an interval duration between the current time and the target update time, wherein the update metric value is used for measuring the update interval duration of the forward characteristic data;
and selecting an updating metric value meeting a preset metric condition from the updating metric values as a target metric value, and taking forward characteristic data corresponding to the target metric value as target characteristic data.
4. The method according to claim 3, wherein calculating the update metric value of the forward characteristic data according to the interval duration between the current time and the target update time comprises:
calculating an updated metric value of the forward characteristic data according to the following formula:
wherein X i ' represents an updated metric value, X, of the forward characteristic data i Indicating the interval duration between the current time and the update time of the forward characteristic data.
5. The method of claim 1, further comprising:
determining feature data to be expanded of each data in the data set to be expanded;
selecting second expansion data of the target service scene in the data set to be expanded according to each feature data to be expanded;
the taking the first extension data as the target extension data of the target service scenario includes:
all users in the first extended data and the second extended data are used as target extended data of the target service scene; or,
and taking a public user in the first extended data and the second extended data as target extended data of the target service scene.
6. The method according to claim 5, wherein the selecting second extension data of the target service scenario in the data set to be extended according to each feature data to be extended includes:
clustering the target characteristic data to obtain a clustering center;
calculating the data distance between each feature data to be expanded and the clustering center;
and selecting second expansion data of the target service scene in the data set to be expanded based on the data distances.
7. The method according to claim 5, wherein the selecting second extension data of the target service scenario in the data set to be extended according to each feature data to be extended includes:
respectively inputting the feature data to be expanded into data expansion models for data expansion aiming at different expansion dimensions to obtain data expansion values output by the data expansion models;
and carrying out weighted average processing on each data expansion value to obtain a target expansion value, and selecting second expansion data in the target service scene in the data set to be expanded according to the target expansion value.
8. The method of claim 7, further comprising:
acquiring sample data and expected output data corresponding to the sample data aiming at the initial network model of each extended dimension, wherein the sample data comprises positive sample data and negative sample data, the positive sample data comprises the reference characteristic data, and the negative sample data is characteristic data of other service scenes except the target service scene;
inputting the sample data into the initial network model to obtain actual output data of the initial network model;
and adjusting parameters of the initial network model according to the actual output data and the expected output data of the sample data to obtain a data expansion model.
9. A data expansion apparatus, comprising:
the data set acquisition module is used for acquiring a reference data set in a target service scene and a data set to be expanded of a related service scene related to the target service scene;
the characteristic data determining module is used for determining reference characteristic data of each data in the reference data set and determining target characteristic data in the reference characteristic data;
and the target expansion data acquisition module is used for selecting first to-be-expanded data with the target characteristic data in the to-be-expanded data set and using the first to-be-expanded data as target expansion data of the target service scene.
10. A computer-readable storage medium storing computer instructions for causing a processor to perform the data expansion method of any one of claims 1 to 8 when executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210625648.2A CN115017145A (en) | 2022-06-02 | 2022-06-02 | Data expansion method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210625648.2A CN115017145A (en) | 2022-06-02 | 2022-06-02 | Data expansion method, device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115017145A true CN115017145A (en) | 2022-09-06 |
Family
ID=83072167
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210625648.2A Pending CN115017145A (en) | 2022-06-02 | 2022-06-02 | Data expansion method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115017145A (en) |
-
2022
- 2022-06-02 CN CN202210625648.2A patent/CN115017145A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112561077A (en) | Training method and device of multi-task model and electronic equipment | |
CN112559007A (en) | Parameter updating method and device of multitask model and electronic equipment | |
CN114065864A (en) | Federal learning method, federal learning device, electronic device, and storage medium | |
CN115907926A (en) | Commodity recommendation method and device, electronic equipment and storage medium | |
CN117993478A (en) | Model training method and device based on bidirectional knowledge distillation and federal learning | |
CN114564149B (en) | Data storage method, device, equipment and storage medium | |
CN115359322A (en) | Target detection model training method, device, equipment and storage medium | |
CN115665783A (en) | Abnormal index tracing method and device, electronic equipment and storage medium | |
CN114999665A (en) | Data processing method and device, electronic equipment and storage medium | |
CN115017145A (en) | Data expansion method, device and storage medium | |
CN114328855A (en) | Document query method and device, electronic equipment and readable storage medium | |
CN115511014B (en) | Information matching method, device, equipment and storage medium | |
CN116308798B (en) | Stock index drawing method, device, equipment and storage medium | |
CN115578583B (en) | Image processing method, device, electronic equipment and storage medium | |
CN116361460A (en) | Data integration method and device, storage medium, electronic equipment and product | |
CN117455577A (en) | Recommendation method, device, equipment and medium of target product | |
CN114547417A (en) | Media resource ordering method and electronic equipment | |
CN115545341A (en) | Event prediction method and device, electronic equipment and storage medium | |
CN115329940A (en) | Recommendation method, device and equipment for convolution algorithm and storage medium | |
CN114065074A (en) | Audience group acquisition method, and training method and device of user object matching model | |
CN114817611A (en) | Sketch retrieval method and device, electronic equipment and storage medium | |
CN118607656A (en) | Parameter determination method, device, equipment and medium of regression calculation model | |
CN114881573A (en) | Main line logistics goods vehicle-finding recall method and device, electronic equipment and storage medium | |
CN114418123A (en) | Model noise reduction method and device, electronic equipment and storage medium | |
CN113051472A (en) | Modeling method, device, equipment and storage medium of click through rate estimation model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |