CN110471955B

CN110471955B - Relation calculation method and device, computer storage medium and terminal

Info

Publication number: CN110471955B
Application number: CN201910695579.0A
Authority: CN
Inventors: 周广一; 梁秀钦; 白硕
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2022-09-09
Anticipated expiration: 2039-07-30
Also published as: CN110471955A

Abstract

A method, a device, a computer storage medium and a terminal for calculating a relationship comprise: dividing each source data subjected to relational computation into more than two data blocks according to a preset strategy; carrying out relation calculation among the data blocks on the split data blocks to obtain a first relation calculation result; and combining all the obtained first relation calculation results to obtain a second relation calculation result of the source data. The embodiment of the invention reduces the complexity of the relation calculation.

Description

Relation calculation method and device, computer storage medium and terminal

Technical Field

The present disclosure relates to, but not limited to, data processing technologies, and in particular, to a method, an apparatus, a computer storage medium, and a terminal for relationship calculation.

Background

In the relation calculation process, because the data volume of each event source is huge, the burden of a computer is increased by directly performing the association calculation, and a serious test is brought to the calculation process of the computer. Particularly, when fusion calculation is performed by a plurality of event sources, the complexity of the relation calculation increases in an exponential level when one event is added.

To facilitate understanding of the relational computation, the following is a brief description of the definitions involved in the relational computation: 1. implicit relationship: the relationship extracted from the event data by the way of analysis, mining and reasoning is called as a recessive relationship; 2. entity: an entity is an individual who generates an event and is the subject of the event, but it is not limited to a certain category. For example, in the same place, two devices are simultaneously collecting information data, and the first device mainly collects mobile phone number information, and certainly, other accessory information is also provided. The second device collects mainly mobile hotspot (WIFI) signal information, so the two events are different in main body, namely, mobile phone number (international mobile subscriber identity, (IMSI)) and WIFI (media access control address (MAC)); 3. entity information, namely detailed information of an entity, such as WIFI (wireless fidelity) (MAC) signal event collection, including information of an MAC address, a location code, occurrence time, an area code, a zone bit and the like; 4. relation rules: the relationship between entities is divided into: explicit relationships and implicit relationships. The explicit relationship already exists objectively, and can be directly judged through established facts, for example, a WIFI event and a mobile phone number event can be directly judged whether to occur at the same place by reading data information. However, the implicit relationship cannot be directly obtained by simple information reading, which requires a calculation rule or algorithm to perform some statistics and calculations from the historical data to determine whether there is a certain implicit relationship between the two. This relationship is a possibility, i.e. there is only a very high probability that there is a relationship between two entities that satisfy a certain rule. 5. Event data: the method is characterized by comprising the following steps of collecting mobile phone number (IMSI) information data about a base station in the public security field, collecting WIFI (MAC) information data in important places, and obtaining license plate number information data after a vehicle under a traffic gate passes through the traffic gate.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention provides a method and a device for calculating a relationship, a computer storage medium and a terminal, which can reduce the complexity of the relationship calculation.

The embodiment of the invention provides a method for calculating a relationship, which comprises the following steps:

splitting each source data subjected to relational computation into more than two data blocks according to a preset strategy;

carrying out relation calculation among the data blocks on the split data blocks to obtain a first relation calculation result;

and combining all the obtained first relation calculation results to obtain a second relation calculation result of the source data.

In an exemplary embodiment, the splitting into two or more data blocks according to the preset policy includes:

splitting each source data into more than two data blocks according to a preset first unit time length; or the like, or, alternatively,

determining the number of blocks of a data block to be split of each source data according to a received first external instruction; splitting each source data into data blocks with determined block numbers through a preset sampling sample operator;

and the data blocks do not have intersection, and the sum of the start-stop duration of all the data blocks is equal to the start-stop duration of the source data.

In an exemplary embodiment, the performing the relation calculation between the data blocks includes:

sorting the data blocks of the split source data according to the time sequence;

traversing the data blocks at each sequencing position of each source data: and respectively carrying out relation calculation on the data block at the current sorting position of the current source data and the data blocks at the current sorting positions of other source data, the data block at the previous bit and the data block at the next bit.

In an exemplary embodiment, the splitting into two or more data blocks according to a preset splitting rule includes:

splitting the source data into more than two data blocks according to a preset second unit time length; or the like, or, alternatively,

determining the number of blocks of the split data block of the source data according to the received second external instruction; splitting source data into data blocks with determined block numbers through a preset sampling sample operator;

and an intersection of preset lengths exists between adjacent data blocks.

sequencing the split data blocks according to the time sequence;

traversing the data blocks at each sequencing position of each source data: and respectively carrying out relation calculation on the data block at the current sorting position of the current source data and the data blocks at the current sorting positions of other source data.

On the other hand, an embodiment of the present invention further provides a device for calculating a relationship, including: the system comprises a splitting unit, a calculating unit and a merging unit; wherein the content of the first and second substances,

the splitting unit is used for: dividing each source data subjected to relational computation into more than two data blocks according to a preset strategy;

the computing unit is to: carrying out relation calculation among the data blocks on the split data blocks to obtain a first relation calculation result;

the merging unit is used for: and combining all the obtained first relation calculation results to obtain a second relation calculation result of the source data.

In an exemplary embodiment, the splitting unit comprises a first splitting module for:

determining the number of data blocks to be split of each source data according to a received first external instruction; splitting each source data into data blocks with determined block numbers through a preset sampling sample operator;

In an exemplary embodiment, the computing unit includes a first computing module to:

traversing the data blocks at each sequencing position of each source data: and respectively carrying out relation calculation on the data block at the current sorting position of the current source data and the data blocks at the current sorting position of other source data, the data block at the previous bit and the data block at the next bit.

In an exemplary embodiment, the splitting unit comprises a second splitting module for:

and an intersection of preset lengths exists between the adjacent data blocks.

In an exemplary embodiment, the computing unit includes a second computing module to:

sequencing the split data blocks according to the time sequence;

In still another aspect, an embodiment of the present invention further provides a computer storage medium, where computer-executable instructions are stored in the computer storage medium, and the computer-executable instructions are used to execute the method for calculating the relationship.

In another aspect, an embodiment of the present invention further provides a terminal, including: a memory and a processor; wherein, the first and the second end of the pipe are connected with each other,

the processor is configured to execute program instructions in the memory;

the program instructions read on the processor to perform the following operations:

dividing each source data subjected to relational computation into more than two data blocks according to a preset strategy;

Compared with the related art, the technical scheme of the application comprises the following steps: dividing each source data subjected to relational computation into more than two data blocks according to a preset strategy; carrying out relation calculation among the data blocks on the split data blocks to obtain a first relation calculation result; and combining all the obtained first relation calculation results to obtain a second relation calculation result of the source data. The embodiment of the invention reduces the complexity of the relation calculation.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flow chart of a method of relational computation according to an embodiment of the invention;

FIG. 2 is a block diagram of an apparatus for relational computation according to an embodiment of the present invention;

FIG. 3 is a schematic logic diagram of a correlation calculation;

FIG. 4 is a diagram illustrating partitioning of source data into data blocks according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating exemplary data inter-block relationship calculations in accordance with the present invention;

FIG. 6 is a diagram illustrating the calculation of the relationship between data blocks according to another exemplary embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Fig. 1 is a flowchart of a method for calculating a relationship according to an embodiment of the present invention, as shown in fig. 1, including:

step 101, dividing each source data subjected to relational computation into more than two data blocks according to a preset strategy;

102, calculating the relationship among the data blocks of the split data blocks to obtain a first relationship calculation result;

and 103, combining all the obtained first relation calculation results to obtain a second relation calculation result of the source data.

It should be noted that the calculation method of the relationship of data between data blocks is the same as that of the relationship of data between source data; in addition, the calculation of the first relation calculation result includes the integration and deduplication processes that have been known in the related art.

In an exemplary embodiment, splitting into two or more data blocks according to a preset policy includes:

and the data blocks do not have intersection, and the sum of the start-stop duration of all the data blocks is equal to the start-stop duration of the source data. Here, the sum of the start-stop time lengths of all data chunks is equal to the source data start-stop time length, and the collection of data chunks equivalent to the source data split is equal to the source data.

It should be noted that the sample operator may implement the operation by means of a spark calculation engine (spark calculation engine is a data processing engine well known to those skilled in the art).

Correspondingly, the calculation of the relationship between the data blocks comprises the following steps:

In an exemplary embodiment, the splitting into two or more data blocks according to the preset splitting rule respectively includes:

determining the number of blocks of the split data blocks of the source data according to the received second external instruction; splitting source data into data blocks with determined block numbers through a preset sampling operator;

and an intersection of preset lengths exists between the adjacent data blocks.

It should be noted that the sample operator may implement the operation by means of a spark calculation engine (spark calculation engine is a data processing engine known to those skilled in the art).

It should be noted that the first unit time length and the second unit time length may be equal, and may be preset time lengths; for example, one hour, two hours, it may be set to split the source data into several equal divisions of duration; the preset length may be set by a related parameter calculated by a person skilled in the art according to a relationship, for example, if the person skilled in the art considers that there may be an implicit relationship between source data of events within a certain time period, the time period may be set to be a preset time period, for example, three minutes.

sequencing the split data blocks according to the time sequence;

Compared with the related art, the technical scheme of the application comprises the following steps: splitting each source data subjected to relational computation into more than two data blocks according to a preset strategy; carrying out relation calculation among the data blocks on the split data blocks to obtain a first relation calculation result; and combining all the obtained first relation calculation results to obtain a second relation calculation result of the source data. The embodiment of the invention reduces the complexity of the relation calculation.

Fig. 2 is a block diagram of a relationship calculation apparatus according to an embodiment of the present invention, as shown in fig. 2, including: the system comprises a splitting unit, a calculating unit and a merging unit; wherein, the first and the second end of the pipe are connected with each other,

splitting each source data into more than two data blocks according to a preset first unit time length; or the like, or a combination thereof,

and an intersection of preset lengths exists between the adjacent data blocks.

sequencing the split data blocks according to the time sequence;

The embodiment of the invention also provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are used for executing the method for calculating the relation.

An embodiment of the present invention further provides a terminal, including: a memory and a processor; wherein, the first and the second end of the pipe are connected with each other,

the processor is configured to execute program instructions in the memory;

the program instructions read at the processor to perform the following operations:

The method of the embodiment of the present invention is clearly and specifically explained by the application examples, which are only used for illustrating the present invention and are not used for limiting the protection scope of the present invention.

Application example

When the application example aims at multi-source fusion association calculation, the relation is proposed in a complex way, the execution time of the relation calculation can be reduced, and the data processing performance is improved.

Taking implicit relations as an example, one type of computation in implicit relations is to compute relations among multiple event sources, such as: in a plurality of places and a plurality of time periods, three devices are used for collecting mobile phone number (IMSI) information data, WIFI (MAC) information data and license plate number information data respectively. The three devices respectively collect the source data of the three events, and then the skilled person can consider that the source data of the events may have an implicit relationship in the same place and within a certain time interval. The process of fusing the data of the plurality of event sources is high in calculation complexity and long in consumed time. Fig. 3 is a schematic logic diagram of a related art relationship calculation, wherein the date in fig. 3 may include specific year, month and day information, and the time includes specific time information, as shown in fig. 3, source data of multiple events, if the relationship calculation is performed by using the source data, the calculation is complex. In order to calculate the implicit relationship between a plurality of data sources, the related art performs correlation (Join) between a plurality of tables, and then performs result screening by combining with a filtering condition. For example, each of the three source data is traversed, each of the source data is selected to be associated with each of the other source data, and if they appear at the same place and the time interval before and after the occurrence of the data does not exceed three minutes (preset time duration), it can be preliminarily determined that the two data are related, and possibly, the data are data signals sent by the same user. The result of the above processing is a cartesian product, the amount of calculation data is very large, and the calculation time is very long and the calculation complexity is exponentially increased under the condition of lacking cluster resource conditions.

The application example of the invention divides the source data into a plurality of data blocks, and there are a plurality of ways for dividing the source data into a plurality of data blocks (which can also be called small tables), wherein the first scheme is as follows: the application example of the invention can determine the number of the source data split into the data blocks by the related technical personnel, such as the small tables divided into the specified number by sample operator sampling by means of spark calculation engine. The second scheme is as follows: dividing according to a set unit time length, such as: and dividing the source data into 24 small tables in 24 hours a day, wherein the data in each hour is equivalent to one small table, only the data in the same hour needs to be directly correlated when the correlation is carried out, and finally, the results are merged. Fig. 4 is a schematic diagram illustrating source data is divided into data blocks according to an embodiment of the present invention, and as shown in fig. 4, assuming that the source data is divided into a plurality of data blocks according to year (year), month (month), day (day), hour (hour), that is, source data with the same year, month, day and hour and different hours is divided into data blocks according to an embodiment of the present invention; equivalent to a division into a number of small tables named year-month-day-hour. The application example of the present invention can use the partition in the large data structure storage technology, on one hand, the requirement of the second scheme can be satisfied, and on the other hand, the partition can also accelerate the retrieval of data. In an optional application example, the condition information of splitting the data block may be set with reference to related technologies, and the splitting of the data block may be implemented after the source data is read according to the set condition information.

After the application example of the invention completes the splitting of the data block, the relation calculation is carried out on the split data block; it is assumed that source data 1 and source data 2 are respectively split into 24 data blocks; the split data blocks are sorted according to the time sequence; calculating the relation of the data blocks according to the sequence; the method of relational computation may be different depending on the method of splitting the data blocks.

Example 1: splitting each source data into more than two data blocks according to a preset first unit time length; or, determining the block number of the data block to be split of each source data according to the received first external instruction; splitting each source data into data blocks with determined block numbers through a preset sampling sample operator; and the data blocks do not have intersection, and the sum of the start-stop duration of all the data blocks is equal to the start-stop duration of the source data. Here, the sum of the start-stop time lengths of all the data blocks is equal to the source data start-stop time length, and the aggregate of the data blocks equivalent to the source data split is equal to the source data.

For the data blocks split in the above manner, the method of relational computation includes: sorting the split data blocks of each source data according to the time sequence; traversing the data blocks at each sequencing position of each source data: and respectively carrying out relation calculation on the data block at the current sorting position of the current source data and the data blocks at the current sorting positions of other source data, the data block at the previous bit and the data block at the next bit.

In an exemplary application example, the following relationship calculation may also be performed for the data block split according to the above example: sequencing the split data blocks according to the time sequence; traversing the data blocks at each sequencing position of each source data: and respectively carrying out relation calculation on the data block at the current sorting position of the current source data and the data blocks at the current sorting positions of other source data.

Example 2: splitting the source data into more than two data blocks according to a preset second unit time length; or, determining the number of blocks of the split data block of the source data according to the received second external instruction; splitting source data into data blocks with determined block numbers through a preset sampling sample operator; and an intersection of preset lengths exists between the adjacent data blocks.

For the data blocks split in the above manner, the method for calculating the relationship includes: sequencing the split data blocks according to the time sequence; traversing the data blocks at each sequencing position of each source data: and respectively carrying out relation calculation on the data block at the current sorting position of the current source data and the data blocks at the current sorting positions of other source data.

The result of the relationship calculation of the split data blocks may be stored in a temporary file, and when the data blocks are split into 24 data blocks, the result of the relationship calculation may include 24 temporary files; after the 24 obtained temporary files are combined, the result of the relation calculation of the source data can be obtained. The application example of the invention finds out the relation between the data of two source data according to the business requirement, because two devices at the same place collect data at the same time, the judgment about the implicit relation is as follows: if the time difference of the same place is not more than three minutes, the judgment can be made, and the two data may have a certain relation. In the related technology, two source data are directly compared one by one, so that the complexity is exponentially increased, and the time and resource consumption is huge. After the application example of the invention splits the data block, the data volume can be reduced, and then the correlation operation is carried out, at this time, the data volume of the relation calculation is reduced, therefore, the load of the cluster can be reduced, and the data processing is accelerated. And finally, combining the calculation results of the data blocks, so that the service requirement of the relation calculation can be met. Fig. 5 is a schematic diagram of a relationship calculation between data blocks according to an application example of the present invention, and as shown in fig. 5, taking dividing into 24 data blocks as an example, a data block 1 of source data 1 and a data block 1 of source data 2 perform the relationship calculation to obtain a first relationship calculation result 1; carrying out relation calculation on the data block 2 of the source data 1 and the data block 2 of the source data 2 to obtain a first relation calculation result 2; carrying out relation calculation on the data block 3 of the source data 1 and the data block 3 of the source data 2 to obtain a first relation calculation result 3; and performing relation calculation on the data blocks 4 of the source data 1 and the data blocks 4 of the source data 2 to obtain a first relation calculation result 4, and so on until the relation calculation of all the data blocks is completed.

FIG. 6 is a schematic diagram illustrating a relationship calculation between data blocks according to another application example of the present invention, as shown in FIG. 6, a data block split by source data 1 is stored in a first partition, a data block split by source data 2 is stored in a second partition, and a data block split by source data 3 is stored in a third partition; reading the data blocks in corresponding sequence from each partition according to the rule of data block relation calculation to perform relation calculation, wherein the relation calculation result of the first sequenced data block is stored as a first relation calculation result 1, the relation calculation result of the second sequenced data block is stored as a first relation calculation result 2, and so on until the relation calculation of all the data blocks is completed; and after all first relation calculation results of the data block relation calculation are combined, obtaining the relation calculation results of the source data 1, 2 and 3. Here, in the merging process, the deduplication process may be performed with reference to the correlation theory.

It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by a program instructing associated hardware (e.g., a processor), and the program may be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiments may be implemented in hardware, for example, by an integrated circuit to implement its corresponding function, or in software, for example, by a processor executing a program/instruction stored in a memory to implement its corresponding function. The present invention is not limited to any specific form of combination of hardware and software.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of relational computation, comprising:

splitting each source data subjected to relational computation into more than two data blocks according to a preset strategy; wherein, there is no intersection between the data blocks, and the sum of the start-stop duration of all the data blocks is equal to the start-stop duration of the source data;

combining all the obtained first relation calculation results to obtain a second relation calculation result of the source data;

the calculating the relationship among the data blocks comprises:

2. The method of claim 1, wherein the splitting into two or more data blocks according to the preset policy comprises:

3. The method of claim 1, wherein the splitting into two or more data blocks according to the preset policy comprises:

and an intersection of preset lengths exists between the adjacent data blocks.

4. The method according to claim 1 or 3, wherein the performing of the relation calculation between the data blocks comprises:

sorting the split data blocks according to the time sequence;

5. An apparatus for relational computation, comprising: the system comprises a splitting unit, a calculating unit and a merging unit; wherein, the first and the second end of the pipe are connected with each other,

the splitting unit is used for: dividing each source data subjected to relational computation into more than two data blocks according to a preset strategy; wherein, there is no intersection between the data blocks, and the sum of the start-stop duration of all the data blocks is equal to the start-stop duration of the source data;

the merging unit is used for: combining all the obtained first relation calculation results to obtain a second relation calculation result of the source data;

the computing unit comprises a first computing module configured to:

6. The apparatus of claim 5, wherein the splitting unit comprises a first splitting module configured to:

determining the number of blocks of a data block to be split of each source data according to a received first external instruction; splitting each source data into data blocks with determined block numbers through a preset sampling operator;

7. The apparatus of claim 5, wherein the splitting unit comprises a second splitting module configured to:

and an intersection of preset lengths exists between the adjacent data blocks.

8. The apparatus according to claim 5 or 7, wherein the computing unit comprises a second computing module configured to:

sequencing the split data blocks according to the time sequence;

9. A computer storage medium having stored therein computer-executable instructions for performing the method of relationship calculation of any one of claims 1 to 4.

10. A terminal, comprising: a memory and a processor; wherein, the first and the second end of the pipe are connected with each other,

the processor is configured to execute program instructions in the memory;

dividing each source data subjected to relational computation into more than two data blocks according to a preset strategy; wherein, there is no intersection between the data blocks, and the sum of the start-stop duration of all the data blocks is equal to the start-stop duration of the source data;

the program instructions read at the processor to perform in particular the following operations: