CN116860462B - Multithreading data acquisition method based on multi-bin slicing - Google Patents

Multithreading data acquisition method based on multi-bin slicing Download PDF

Info

Publication number
CN116860462B
CN116860462B CN202311130962.4A CN202311130962A CN116860462B CN 116860462 B CN116860462 B CN 116860462B CN 202311130962 A CN202311130962 A CN 202311130962A CN 116860462 B CN116860462 B CN 116860462B
Authority
CN
China
Prior art keywords
preset
field name
category
bin
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311130962.4A
Other languages
Chinese (zh)
Other versions
CN116860462A (en
Inventor
刘立宇
李强
初乃强
安西平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Singularity Of Life Beijing Technology Co ltd
Singularity Digital Beijing Technology Co ltd
Original Assignee
Singularity Of Life Beijing Technology Co ltd
Singularity Digital Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Singularity Of Life Beijing Technology Co ltd, Singularity Digital Beijing Technology Co ltd filed Critical Singularity Of Life Beijing Technology Co ltd
Priority to CN202311130962.4A priority Critical patent/CN116860462B/en
Publication of CN116860462A publication Critical patent/CN116860462A/en
Application granted granted Critical
Publication of CN116860462B publication Critical patent/CN116860462B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5033Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering data affinity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of electric digital data processing, in particular to a multithreading data acquisition method based on a multi-bin slice. The method comprises the following steps: if a is n Including a number of records not greater than q 0 Then obtain b n The method comprises the steps of carrying out a first treatment on the surface of the If M>q 1 Then go through b n If b n,m For a field name belonging to the preset field name type, b will be n,m Append to a preset first field name sequence C 1 The method comprises the steps of carrying out a first treatment on the surface of the If v is less than or equal to q 1 Then go through C 1 For a n In c) 1,k Classifying the corresponding elements to obtain u k The method comprises the steps of carrying out a first treatment on the surface of the Traversing U, if U k ≤q 2 Will be a n In c) 1,k The class sequence of the corresponding element is added to a preset first class sequence G; traversing G to obtain G 1,j Standard deviation sigma of category number j And will sigma j Adding a preset first standard deviation set S'; class pair a according to field name corresponding element corresponding to min (S') n The records included are multi-threaded Cheng Caiji. The application improves the efficiency of data acquisition.

Description

Multithreading data acquisition method based on multi-bin slicing
Technical Field
The application relates to the technical field of electric digital data processing, in particular to a multithreading data acquisition method based on a multi-bin slice.
Background
In the prior art, a single-thread data acquisition mode is mostly adopted, if the data volume to be transmitted is large, the time spent for data acquisition is relatively long, and the user experience on data acquisition is poor. In order to reduce the time of data acquisition and further improve the efficiency of data acquisition, a multithreading data acquisition mode can be used, however, the number of threads is not as high as possible, and the number of threads can cause bad influence on a data bin end and an acquisition end. On the premise that the number of threads used does not exceed the preset number of threads, the overall time of multi-thread acquisition is shortened, and the efficiency of multi-thread acquisition is improved, so that the problem to be solved is urgent.
Disclosure of Invention
The application aims to provide a multithreading data acquisition method based on a plurality of bin slices, so that the data acquisition time is effectively reduced and the data acquisition efficiency is improved on the premise that the number of threads used does not exceed the preset number of threads.
According to the application, a multithreading data acquisition method based on a multi-bin slice is provided, wherein the multi-bin comprises a target list set A, A= { a 1 ,a 2 ,…,a n ,…,a N },a n The value range of N is 1 to N for the nth target list included in the number bin, and N is the number of the target lists included in the number bin; the multithreading data acquisition method comprises the following steps:
s100, obtaining a n The number of records Q is included, if Q.ltoreq.q 0 S200 is performed; q 0 Is a preset recording number threshold.
S200, obtaining a n Field name b n ,b n =(b n,1 ,b n,2 ,…,b n,m ,…,b n,M ),b n,m Is a as n Includes the M field name, M is in the range of 1 to M, M is a n Number of field names included.
S300, if M>q 1 Then go through b n If b n,m For a field name belonging to the preset field name type, b will be n,m Append to a preset first field name sequence C 1 Obtaining C 1 =(c 1,1 ,c 1,2 ,…,c 1,k ,…,c 1,v ),c 1,k Is added to C for the kth 1 The field name of (1), k has a value of 1 to v, v being appended to C 1 The number of field names of (a); c (C) 1 Is initialized to a null value;q 1 a threshold value for the number of preset first field names.
S400, if v.ltoreq.q 1 Then S500 is entered.
S500, traversing C 1 For a n In c) 1,k Classifying the corresponding elements to obtain a n In c) 1,k Category number u of corresponding element k ;a n In c) 1,k The corresponding element is a n Is positioned at c 1,k In column and not including c 1,k An element therein.
S600, walk u= (U) 1 ,u 2 ,…,u k ,…,u v ) If u k ≤q 2 Will be a n In c) 1,k The class sequence of the corresponding element is added to a preset first class sequence G to obtain G= (G) 1,1 ,G 1,2 ,…,G 1,j ,…,G 1,w ),G 1,j For the j-th class sequence added to G, j has a value ranging from 1 to w, w being the number of class sequences added to G; initializing G to be a null value; q 2 Is a preset thread number threshold.
S700, traversing G to obtain G 1,j Standard deviation sigma of category number j And will sigma j The initialization added to the preset first standard deviation set S ', S' is an empty set.
S800, corresponding to the category pair a of the field name corresponding element according to the min (S') n The included records are collected in a multithreading way, and min () is the minimum value.
Compared with the prior art, the application has at least the following beneficial effects:
for a to be transmitted n The application obtains a n Number of records included, if a n The number of records included is smaller, then for a n Judging the number of the included field names; if a is n If the number of the included field names is large, selecting a n The included field names belong to the field names of the preset field name types, if the number of the field names belonging to the preset field name types is smaller, classifying the elements corresponding to each field name belonging to the preset field name types, and taking the preset into considerationFurther acquiring the standard deviation of the number of categories corresponding to the field names with the number of categories smaller than the preset thread number threshold, and taking the element category corresponding to the field name with the minimum standard deviation of the number of categories as the pair a n The records are used for multi-thread collection, one type of records are collected by one thread, therefore, on the premise that the number of threads is not more than a preset thread number threshold value, the number difference of the records corresponding to each thread is small, the collection time difference corresponding to each thread is small, the situation that the data collection time is long as a whole due to the fact that the number of records corresponding to each thread is too unbalanced is avoided, and the data collection efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a multi-threaded data acquisition method based on a multi-bin slice according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
According to the application, a multithreading data acquisition method based on a multi-bin slice is provided, wherein the multi-bin comprises a target list set A, A= { a 1 ,a 2 ,…,a n ,…,a N },a n For the nth target list included in the number bin, the value range of N is 1 to N, and N is the number of the target list included in the number binAn amount of; the multithreading data acquisition method comprises the following steps, as shown in fig. 1:
s100, obtaining a n The number of records Q is included, if Q.ltoreq.q 0 S200 is performed; q 0 Is a preset recording number threshold.
The acquisition method of each target list included in the log bin in this embodiment is the same, and in this embodiment, the log bin is used for the log a n The collection is illustrated as an example.
Q in this example 0 As an empirical value, optionally, q 0 On the order of millions or tens of millions.
S200, obtaining a n Field name b n ,b n =(b n,1 ,b n,2 ,…,b n,m ,…,b n,M ),b n,m Is a as n Includes the M field name, M is in the range of 1 to M, M is a n Number of field names included.
S300, if M>q 1 Then go through b n If b n,m For a field name belonging to the preset field name type, b will be n,m Append to a preset first field name sequence C 1 Obtaining C 1 =(c 1,1 ,c 1,2 ,…,c 1,k ,…,c 1,v ),c 1,k Is added to C for the kth 1 The field name of (1), k has a value of 1 to v, v being appended to C 1 The number of field names of (a); c (C) 1 Is initialized to a null value; q 1 A threshold value for the number of preset first field names.
Q in this example 1 Is an empirical value. The embodiment is shown in M>q 1 At the time, judge a n The number of the included field names is large, so as to avoid the time for the subsequent classifying process of the elements corresponding to the field names to be long, in the embodiment, M>q 1 When obtain b n A field name belonging to a preset field name type.
Specifically, S300 includes:
s310, pair b n,m Word segmentation processing is carried out to obtain b n,m Corresponding word segmentation set FC n,m ,FC n,m ={fc 1 n,m ,fc 2 n,m ,…,fc zj n,m ,…,fc lc n,m },fc zj n,m For pair b n,m The zj-th word obtained by word segmentation is in the value range of 1 to cl,clFor pair b n,m The number of words obtained by performing word segmentation processing.
Those skilled in the art will appreciate that any word segmentation method in the prior art falls within the scope of the present application.
S320, acquiring a preset vocabulary set CB, wherein CB= { CB 1 ,cb 2 ,…,cb qb ,…,cb QB },cb qb QB is the preset vocabulary included in CB, the value range of QB is 1 to QB, QB is the number of the preset vocabulary included in CB, and each CB qb The vocabulary input for the user is either the vocabulary for characterizing time or the vocabulary for characterizing category.
In this embodiment, the CB includes a vocabulary input by a user, a vocabulary for characterizing time, and a vocabulary for characterizing a category, where the vocabulary input by the user is a vocabulary predetermined by the user according to an actual application scenario; the vocabulary for characterizing time includes date and time, etc.; the vocabulary used to characterize the categories includes category, group, category and type, and the like.
S330, traversing FC n,m And CB, get fc zj n,m And cb qb Semantic similarity xsd of (2) zj qb
Those skilled in the art will appreciate that any method of obtaining semantic similarity between two words in the prior art falls within the scope of the present application.
S340, if xsd zj qb If the similarity is greater than or equal to a preset similarity threshold, judging b n,m A field name belonging to a preset field name type; otherwise, judge b n,m Is a field name that does not belong to the preset field name type.
In the present embodiment, only a certain fc zj n,m And a certain CB in CB qb If the semantic similarity of the number (b) is greater than or equal to a preset similarity threshold, judging b n,m For fields belonging to a preset field name typeA name; only when any fc zj n,m With any CB of CB qb B is judged when the semantic similarity of the two images is smaller than a preset similarity threshold value n,m Is a field name that does not belong to the preset field name type. In this embodiment, the preset similarity threshold is an empirical value, and optionally, the preset similarity threshold has a value range of [0.8,0.9 ]]。
S400, if v.ltoreq.q 1 Then S500 is entered.
S500, traversing C 1 For a n In c) 1,k Classifying the corresponding elements to obtain a n In c) 1,k Category number u of corresponding element k ;a n In c) 1,k The corresponding element is a n Is positioned at c 1,k In column and not including c 1,k An element therein.
A in the present embodiment n In c) 1,k The corresponding element is referred to as a n Is positioned at c 1,k The element of the column (excluding c 1,k Internal).
Those skilled in the art will appreciate that any method of classifying elements in the prior art falls within the scope of the present application. As one prior art, group by clauses are used to categorize elements.
S600, walk u= (U) 1 ,u 2 ,…,u k ,…,u v ) If u k ≤q 2 Will be a n In c) 1,k The class sequence of the corresponding element is added to a preset first class sequence G to obtain G= (G) 1,1 ,G 1,2 ,…,G 1,j ,…,G 1,w ),G 1,j For the j-th class sequence added to G, j has a value ranging from 1 to w, w being the number of class sequences added to G; initializing G to be a null value; q 2 Is a preset thread number threshold.
Specifically, G 1,j =(g 1 1,j ,g 2 1,j ,…,g i 1,j ,…,g x 1,j ),g i 1,j Is G 1,j Includes the ith category, i has a value ranging from 1 to x, x is G 1,j ComprisingNumber of categories.
Alternatively, q 2 Is an empirical value; preferably, q 2 The acquisition process of (1) comprises:
s610, obtaining the maximum thread connection number ang of the number bins max
S620, obtaining the current thread connection number ang of the number bins 0
S630, acquiring the maximum acquisition task number ang allowed by the acquisition end memory 1 ,ang 1 =floor(nc 0 /nc 1 );nc 0 Nc is the current free memory size of the acquisition end 1 And (3) setting floor () as a downward rounding for the memory size occupied by a preset acquisition task.
Nc in the present embodiment 1 Is an empirical value.
S640, obtaining the number ang of thread connections allowed by the CPU of the acquisition end 2 The method comprises the steps of carrying out a first treatment on the surface of the When the utilization rate of the CPU at the acquisition end is in different preset utilization rate ranges, ang 2 And presetting a value for the thread connection quantity corresponding to the preset utilization rate range.
As a specific implementation manner, when the utilization rate of the CPU at the acquisition end is 10% -20%, the corresponding preset value of the thread connection quantity is 10; when the utilization rate of the CPU at the acquisition end is 20% -40%, the corresponding thread connection quantity preset value is 5; when the utilization rate of the CPU at the acquisition end is 40% -50%, the corresponding thread connection quantity preset value is 3.
S650, obtain q 2 ,q 2 =min((ang max -ang 0 ),ang 1 ,ang 2 )。
Q obtained according to the method S610-S650 2 The performance of the bin ends and the acquisition ends is considered, and adverse effects caused by the bin ends and the acquisition ends due to the fact that the number of threads is large are avoided.
S700, traversing G to obtain G 1,j Standard deviation sigma of category number j And will sigma j The initialization added to the preset first standard deviation set S ', S' is an empty set.
Specifically, sigma j =((∑ x i=1 (num i , j -num j ) 2 )/x) 0.5 ,num i , j Is a as n In c) 1,k The category in the corresponding element is g i 1,j Number of elements, num j Is a as n In c) 1,k Average value of element number of each category in corresponding element, num j =(∑ x i= 1 num i , j )/x。
S800, corresponding to the category pair a of the field name corresponding element according to the min (S') n The included records are collected in a multithreading way, and min () is the minimum value.
Specifically, the category pair a of the corresponding element according to the field name corresponding to the min (S') n The multi-thread acquisition of the included records comprises: using shu thread pairs a n The method comprises the steps that the included records are collected, each thread is used for collecting records corresponding to one category of the field name corresponding element corresponding to the min (S '), and shu is the category number of the field name corresponding element corresponding to the min (S').
For a to be transmitted n The application obtains a n Number of records included, if a n The number of records included is smaller, then for a n Judging the number of the included field names; if a is n If the number of the included field names is large, selecting a n The method comprises the steps of classifying elements corresponding to field names belonging to preset field name types in the included field names, further obtaining standard deviation of the number of categories corresponding to the field names with the number of categories being smaller than a preset thread number threshold value, taking the element category corresponding to the field name with the smallest standard deviation of the number of categories as a pair of a according to the preset thread number threshold value factor n The included records are subjected to multi-thread collection, and one thread is used for collecting records corresponding to one category, so that on the premise that the number of threads meets a preset thread number threshold value, the number difference of the records corresponding to each thread is small, the collection time difference corresponding to each thread is small, and the data collection time caused by too uneven number of the records corresponding to each thread is not generated in the wholeLonger condition improves data acquisition's efficiency.
In this embodiment, S100 further includes: if Q>q 0 S2000 is performed;
s2000, pair a n Randomly sampling to obtain a' n =(r 1 ,r 2 ,…,r d ,…,r D ),r d To a pair of a n D is recorded by randomly sampling, wherein the value range of D is 1 to D, D is the preset sampling times, and D is less than or equal to q 0
S3000, if M.ltoreq.q 1 Then go through b n For a' n B in (b) n,m Classifying the corresponding elements to obtain a' n B in (b) n,m Category number u of corresponding element n,m ;a’ n B in (b) n,m The corresponding element is a' n Is positioned at b n,m In column and not including b n,m An element therein.
For M>q 1 The processing method in this case may refer to S300-S800, and will not be described here again.
S4000, traversing U n =(u n,1 ,u n,2 ,…,u n,m ,…,u n,M ) If u n,m ≤q 2 Then a 'is' n B in (b) n,m The class sequence of the corresponding element is added to a preset second class sequence G 'to obtain G' = (G ')' 1 ,G’ 2 ,…,G’ sjc ,…,G’ lia ),G’ sjc For the sjc th class sequence added to G ', sjc has a value ranging from 1 to lia, which is the number of class sequences added to G'.
S5000, traversing G 'to obtain G' sjc Standard deviation of category number, and G' sjc The standard deviation of the category number is added to a preset second standard deviation set S 0 , S 0 Is initialized to an empty set.
S6000, according to min (S 0 ) Category pair a of corresponding field name corresponding element n The records included are multi-threaded Cheng Caiji.
Specifically, according to min (S 0 ) Category of corresponding field name corresponding elementPair a n The multi-thread acquisition of the included records comprises: using shu' threads to pair a n The records included are collected, each thread is used to collect min (S 0 ) A record corresponding to one category of the corresponding field name corresponding element, shu' is min (S 0 ) The corresponding field name corresponds to the number of categories of the element.
S2000-S6000 of the present embodiment are applicable to Q>q 0 I.e. a n The case of a large number of records included; in this case, the present embodiment first refers to a n Random sampling is carried out to obtain a sampled list a' n ;a’ n And a n In contrast, the two include the same field name, a' n The number of records D is less than a n The number of records Q included; on the basis, the embodiment is directed to a' n Classifying the elements corresponding to the field names in the middle, and according to a' n Corresponding classification result determines pair a n A recorded multithreading acquisition strategy is included; due to a' n Is to a n Obtained by random sampling, a' n The corresponding classification result can be used to characterize a n The corresponding classification result is that due to a' n The number of records D is less than a n The number Q of records is included, so that the method and the device can balance the quantity of the collection of each thread and reduce the time for determining the multithread collection strategy, further reduce the overall time of data collection and improve the efficiency of data collection.
In this embodiment, S400 further includes: if v>q 1 S410 is performed.
S410, acquiring the weight β ', β' = (β) of the preset field name type 12 ,…,β e ,…,β E ),β e The value range of E is 1 to E, and E is the number of the preset field name types; 0<β e <1,∑ E e=1 β e =1。
As a specific embodiment, the number of preset field name types is 3, the first preset field name type corresponds to the user input field name type, and the corresponding weight is beta 1 The method comprises the steps of carrying out a first treatment on the surface of the The second preset field name type corresponds to the field name type representing the time, and the corresponding weight is beta 2 The method comprises the steps of carrying out a first treatment on the surface of the The third preset field name type corresponds to the field name type of the characterization category, and the corresponding weight is beta 3 . In this embodiment, the weight corresponding to each preset field name type is an empirical value, and optionally, β 123
S420, traversing beta', obtaining the number of field names qua (e) corresponding to the e-th preset field name type, qua (e) =floor (beta e ×q 1 ) Floor () is rounded down.
S430, obtaining the target field name C ', C ' = (C ' 1 ,c’ 2 ,…,c’ e ,…,c’ E ),c’ e For the field name set corresponding to the e 'th preset field name type, c' e ={c’ e,1 ,c’ e,2 ,…,c’ e,γ ,…,c’ e,qua(e) },c’ e,γ For randomly from a n The field name type in the method is the gamma field name selected from the field names of the e preset field name type, and the value range of gamma is 1 to qua (e).
It should be understood that the number of field names corresponding to each preset field name type included in C 'is positively correlated with the corresponding weight, and the number of field names included in C' is equal to or less than q 1
S440, traversing C', for a n Middle c' e,γ Classifying the corresponding elements, and classifying a according to the classification result n The comprised records are transmitted in a multithreading manner; a, a n Middle c' e,γ The corresponding element is a n In c' e,γ In columns and excluding c' e,γ An element therein.
According to pair a in this embodiment n Middle c' e,γ Result pair a of classifying corresponding elements n The method of multithreading the records involved is similar to the method of S500-S800 and will not be repeated here.
S410-S440 of the present embodiment is applicable to a n The case of more field names belonging to the preset field name type is included in the listNext, in this embodiment, the weight pair q corresponding to each preset field name type 1 Splitting to ensure that the number of field names which are required to be classified and correspond to each preset field name type is positively correlated with the corresponding weight, and the sum of the number of field names which are required to be classified and correspond to each preset field name type does not exceed q 1 Thus, the number of field names to be categorized in the present embodiment does not exceed q 1 The time of the classifying process can be reduced, and the overall time of data acquisition is further reduced.
While certain specific embodiments of the application have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the application. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the application. The scope of the application is defined by the appended claims.

Claims (6)

1. The multithreading data acquisition method based on the multi-bin slicing is characterized in that the multi-bin comprises a target list set A, A= { a 1 ,a 2 ,…,a n ,…,a N },a n The value range of N is 1 to N for the nth target list included in the number bin, and N is the number of the target lists included in the number bin; the multithreading data acquisition method comprises the following steps:
s100, obtaining a n The number of records Q is included, if Q.ltoreq.q 0 S200 is performed; q 0 A preset recording quantity threshold value;
s200, obtaining a n Field name b n ,b n =(b n,1 ,b n,2 ,…,b n,m ,…,b n,M ),b n,m Is a as n Includes the M field name, M is in the range of 1 to M, M is a n Number of field names included;
s300, if M>q 1 Then go through b n If b n,m For a field name belonging to the preset field name type, b will be n,m Append to a preset first fieldName sequence C 1 Obtaining C 1 =(c 1,1 ,c 1,2 ,…,c 1,k ,…,c 1,v ),c 1,k Is added to C for the kth 1 The field name of (1), k has a value of 1 to v, v being appended to C 1 The number of field names of (a); c (C) 1 Is initialized to a null value; q 1 A threshold value for the number of the preset first field names;
s400, if v.ltoreq.q 1 Then enter S500;
s500, traversing C 1 For a n In c) 1,k Classifying the corresponding elements to obtain a n In c) 1,k Category number u of corresponding element k ;a n In c) 1,k The corresponding element is a n Is positioned at c 1,k In column and not including c 1,k An element therein;
s600, walk u= (U) 1 ,u 2 ,…,u k ,…,u v ) If u k ≤q 2 Will be a n In c) 1,k The class sequence of the corresponding element is added to a preset first class sequence G to obtain G= (G) 1,1 ,G 1,2 ,…,G 1,j ,…,G 1,w ),G 1,j For the j-th class sequence added to G, j has a value ranging from 1 to w, w being the number of class sequences added to G; initializing G to be a null value; q 2 A preset thread number threshold value;
s700, traversing G to obtain G 1,j Standard deviation sigma of category number j And will sigma j Initializing an empty set by adding a preset first standard deviation set S ', S';
s800, corresponding to the category pair a of the field name corresponding element according to the min (S') n The multi-thread collection is carried out on the included records, and min () is the minimum value;
q 2 the acquisition process of (1) comprises:
s610, obtaining the maximum allowable thread connection number ang of the number bins max
S620, obtaining the current thread connection number ang of the number bins 0
S630, acquiring the content of the acquisition endThe maximum allowable collection task number ang 1 ,ang 1 =floor(nc 0 /nc 1 );nc 0 Nc is the current free memory size of the acquisition end 1 The floor () is a downward rounding for the memory size occupied by a preset acquisition task;
s640, obtaining the number ang of thread connections allowed by the CPU of the acquisition end 2 The method comprises the steps of carrying out a first treatment on the surface of the When the utilization rate of the CPU at the acquisition end is in different preset utilization rate ranges, ang 2 A preset value is set for the thread connection quantity corresponding to the preset utilization rate range;
s650, obtain q 2 ,q 2 =min((ang max -ang 0 ),ang 1 ,ang 2 )。
2. The multi-threaded data acquisition method based on multi-bin slicing of claim 1, wherein S100 further comprises: if Q>q 0 S2000 is performed;
s2000, pair a n Randomly sampling to obtain a' n =(r 1 ,r 2 ,…,r d ,…,r D ),r d To a pair of a n D is recorded by randomly sampling, wherein the value range of D is 1 to D, D is the preset sampling times, and D is less than or equal to q 0
S3000, if M.ltoreq.q 1 Then go through b n For a' n B in (b) n,m Classifying the corresponding elements to obtain a' n B in (b) n,m Category number u of corresponding element n,m ;a’ n B in (b) n,m The corresponding element is a' n Is positioned at b n,m In column and not including b n,m An element therein;
s4000, traversing U n =(u n,1 ,u n,2 ,…,u n,m ,…,u n,M ) If u n,m ≤q 2 Then a 'is' n B in (b) n,m The class sequence of the corresponding element is added to a preset second class sequence G 'to obtain G' = (G ')' 1 ,G’ 2 ,…,G’ sjc ,…,G’ lia ),G’ sjc Is added to sjc thClass sequence of G ', sjc having a value ranging from 1 to lia, lia being the number of class sequences appended to G'; initializing G' to be a null value;
s5000, traversing G 'to obtain G' sjc Standard deviation of category number, and G' sjc The standard deviation of the category number is added to a preset second standard deviation set S 0 ,S 0 Is initialized to an empty set;
s6000, according to min (S 0 ) Category pair a of corresponding field name corresponding element n The records included are multi-threaded Cheng Caiji.
3. The multi-threaded data acquisition method based on multi-bin slicing of claim 1, wherein S400 further comprises: if v>q 1 S410 is performed;
s410, acquiring the weight β ', β' = (β) of the preset field name type 12 ,…,β e ,…,β E ),β e The value range of E is 1 to E, and E is the number of the preset field name types; 0<β e <1,∑ E e=1 β e =1;
S420, traversing beta', obtaining the number of field names qua (e) corresponding to the e-th preset field name type, qua (e) =floor (beta e ×q 1 ) Floor () is rounded down;
s430, obtaining the target field name C ', C ' = (C ' 1 ,c’ 2 ,…,c’ e ,…,c’ E ),c’ e For the field name set corresponding to the e 'th preset field name type, c' e ={c’ e,1 ,c’ e,2 ,…,c’ e,γ ,…,c’ e,qua(e) },c’ e,γ For randomly from a n The field name type in the middle is the gamma field name selected from the field names of the e preset field name type, and the value range of gamma is 1 to qua (e);
s440, traversing C', for a n Middle c' e,γ Classifying the corresponding elements, and classifying a according to the classification result n The comprised records are transmitted in a multithreading manner; a, a n Middle c' e,γ The corresponding element is a n In c' e,γ In columns and excluding c' e,γ An element therein.
4. The multi-threaded data acquisition method based on multi-bin slicing of claim 1, wherein G 1,j =(g 1 1,j ,g 2 1,j ,…,g i 1,j ,…,g x 1,j ),g i 1,j Is G 1,j Includes the ith category, i has a value ranging from 1 to x, x is G 1,j The number of categories included; sigma (sigma) j =((∑ x i=1 (num i , j -num j ) 2 )/x) 0.5 ,num i , j Is a as n In c) 1,k The category in the corresponding element is g i 1,j Number of elements, num j Is a as n In c) 1,k The average of the number of elements for each category in the corresponding element.
5. The multi-threaded data acquisition method based on multi-bin slicing of claim 4, wherein num j =(∑ x i=1 num i , j )/x。
6. The multi-threaded data acquisition method based on multi-bin slicing of claim 1, wherein S800 comprises: using shu thread pairs a n The method comprises the steps that the included records are collected, each thread is used for collecting records corresponding to one category of the field name corresponding element corresponding to the min (S '), and shu is the category number of the field name corresponding element corresponding to the min (S').
CN202311130962.4A 2023-09-04 2023-09-04 Multithreading data acquisition method based on multi-bin slicing Active CN116860462B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311130962.4A CN116860462B (en) 2023-09-04 2023-09-04 Multithreading data acquisition method based on multi-bin slicing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311130962.4A CN116860462B (en) 2023-09-04 2023-09-04 Multithreading data acquisition method based on multi-bin slicing

Publications (2)

Publication Number Publication Date
CN116860462A CN116860462A (en) 2023-10-10
CN116860462B true CN116860462B (en) 2023-11-17

Family

ID=88230819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311130962.4A Active CN116860462B (en) 2023-09-04 2023-09-04 Multithreading data acquisition method based on multi-bin slicing

Country Status (1)

Country Link
CN (1) CN116860462B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597162A (en) * 2020-12-25 2021-04-02 平安银行股份有限公司 Data set acquisition method, system, device and storage medium
CN113961316A (en) * 2021-11-09 2022-01-21 山东志盈医学科技有限公司 Digital slice scanning method and device based on multiple threads
CN115098336A (en) * 2022-07-21 2022-09-23 中国平安财产保险股份有限公司 Method, system, equipment and storage medium for monitoring warehouse tasks
CN115756828A (en) * 2022-10-26 2023-03-07 中国建设银行股份有限公司上海市分行 Multithreading data file processing method, equipment and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552574A (en) * 2019-09-25 2020-08-18 华为技术有限公司 Multithreading synchronization method and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597162A (en) * 2020-12-25 2021-04-02 平安银行股份有限公司 Data set acquisition method, system, device and storage medium
CN113961316A (en) * 2021-11-09 2022-01-21 山东志盈医学科技有限公司 Digital slice scanning method and device based on multiple threads
CN115098336A (en) * 2022-07-21 2022-09-23 中国平安财产保险股份有限公司 Method, system, equipment and storage medium for monitoring warehouse tasks
CN115756828A (en) * 2022-10-26 2023-03-07 中国建设银行股份有限公司上海市分行 Multithreading data file processing method, equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于线程池数据分析系统的设计与实现;詹新林;王公亭;徐晓钟;;微计算机信息(第33期);全文 *

Also Published As

Publication number Publication date
CN116860462A (en) 2023-10-10

Similar Documents

Publication Publication Date Title
US6104835A (en) Automatic knowledge database generation for classifying objects and systems therefor
US8392415B2 (en) Clustering of content items
CN112132852B (en) Automatic image matting method and device based on multi-background color statistics
CN102135979A (en) Data cleaning method and device
US20100272351A1 (en) Information processing apparatus and method for detecting object in image data
US7324694B2 (en) Fluid sample analysis using class weights
CN116860462B (en) Multithreading data acquisition method based on multi-bin slicing
CN116805511A (en) Single cell transcriptome cell debris and multicellular filtration method, medium and equipment
US7308137B2 (en) Method of determining color composition of an image
CN115510331B (en) Shared resource matching method based on idle amount aggregation
CN111400597A (en) Information classification method based on k-means algorithm and related equipment
CN110852443B (en) Feature stability detection method, device and computer readable medium
CN116881014B (en) Processing method for multi-thread data acquisition
CN113888318A (en) Risk detection method and system
CN116841756B (en) Acquisition method of target incremental data
CN110795473A (en) Bootstrap-method-based accelerated search method
CN113068067B (en) Account recalling method and device
CN111367820B (en) Sequencing method and device for test cases
CN114726610B (en) Method and device for detecting attack of automatic network data acquirer
CN116152538A (en) Method for searching threshold value by image classification
CN114282525A (en) Text classification method, system and computer equipment based on improved TF-IDF
CN114299491A (en) Cell image acquisition method, device, equipment and storage medium
CN117520043A (en) Quality evaluation method, equipment and storage medium for Nand false
CN114372190A (en) Internet mass data retrieval method and retrieval system
CN116010390A (en) Database processing method, device and medium based on time sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant