WO2012124995A2

WO2012124995A2 - Method and system for maintaining vector clocks during synchronization for data race detection

Info

Publication number: WO2012124995A2
Application number: PCT/KR2012/001880
Authority: WO
Inventors: Parikshit KOLIPAKA; Rahul Nagpal
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2011-03-15
Filing date: 2012-03-15
Publication date: 2012-09-20
Also published as: WO2012124995A3

Abstract

Method and system for maintaining vector clocks during synchronization for data race detection. Embodiments herein disclose methods to reduce overheads of maintaining and updating vector clock during synchronization in vector based dynamic data race detection systems. Embodiments herein enable improvement of vector based dynamic data race detection systems orthogonally without compromising with precision of the system by using opportunistic methods to reduce overheads during synchronization of threads.

Description

METHOD AND SYSTEM FOR MAINTAINING VECTOR CLOCKS DURING SYNCHRONIZATION FOR DATA RACE DETECTION

Embodiments herein relate to dynamic data race detection, and, more particularly, to reducing overheads in maintaining and updating vector clocks during synchronization.

Vector Clock based dynamic data race detector provides a general dynamic analysis framework based on vector clock mechanism for detecting data races in concurrent programs at run time more precisely (fewer false positive) than either static or any other dynamic approach, such as lock-set based approach. The major issue with the dynamic data race detector is space and time overheads of maintaining and updating vector clocks that is O(n) in general where n is number of threads. Increasing number of cores on chip and high degree of threading supported by cores and GPGPUS further exaggerate performance and space overheads associated with vector clocks.

A data race condition occurs when two threads access same memory location at the same time without synchronization (or not ordered by happens before) and at least one of these memory access is a write access. Race conditions are inherently difficult to detect, reproduce and eliminate primarily because they occur rarely and only in certain rare executions and rare contexts. The major trade-offs between static and dynamic data race detectors is that of soundness vs. precisions. In contrast to static data race detector which do not actually run the program but never miss a data race if one exist in program (soundness), at the cost of being conservative and producing lots of false positive (less precise). But dynamic data race detectors actually run the program to gain in precision. Dynamic data race detector (DDRD) perform and scale better as compared to static race detectors, that bear the inherent curse of algorithmic overheads involved in deep program analysis. However, any overhead due to DDRD directly impacts the program execution time and not merely the compilation or analysis time as in static techniques. Improving the performance overheads of DDRD is further fueled by growing popularity of multi-core architecture and GPGPUs. Current state-of-the-art tools trade accuracy (preciseness) for the speed.

The lockset based approach of DDRD is limited to detecting races in program that use most popular synchronization primitive i.e. locking discipline. The lockset based approach assumes that all the variables are guarded by all the locks in the beginning of program execution. As it processes the trace of the program, it iteratively refines the locks associated with the shared variables and whenever it finds that a variable is not guarded by proper locks, it alarms a possible data race. This approach being limited to locking discipline leads to lots of false positives in the presence of other synchronization primitive such as fork-join, and wait-notify among others.

The main idea behind a Purely Happens Before technique (PHB) is to monitor all the thread and their accesses to the shared memory locations in the current execution trace and deduce a partial order called happens before as imposed by the synchronization primitives. Formally, happens before relationship among program statements is defined as follows. Statement A happens before B (A < B) if any of the following is true: A executes before B in the same thread or A and B are operations on the same synchronization variable, between threads and events are ordered according to the properties of the synchronization objects they access (e.g., A releases a lock, and B subsequently acquires the same lock) or A < C and C < B then A < B, happens before is transitive. This partial order is used thereafter to find out the possibility of access to the same memory location by two different statement not related with happens before relation. If at least one of these is write, the race is detected.

A specific mechanism to implement happens-before is vector clock. A vector clock is essentially defined as a mapping C: Tid -> N@id for all id

N where id represents the thread identification number. Some primitive operation on vector clocks is defined as follows: happens before relationship operation: C₁ < C₂ iff C₁(t) < C₂(t) for each t

id, Join operation: C₁

C₂ = max (C₁(t) , C₂(t)), for each t

id, Ov= 0. Oe=0@0, for each t

id (Ov is the minimal version epoch and Oe is minimal epoch) and INCt (C) = For all j

id if j == t then Ct(j)= Ct(j)+1 else Ct(j)= Ct(j).

During the execution of the program, race detector need to maintain multiple vector clocks such as vector clocks to store the clock value of each of the thread by every other thread and storing last read and write by any thread for each shared variable among others. As the program executes these clocks are updated depending on the read and writes to shared variables of different threads and synchronization operation by different thread. Vector clocks, if not used efficiently leads to expensive O(n) operations and space overheads. Different tools vary in terms of reducing these overheads of updating vector clocks by summarizing vector clock information into a scalar thereby reducing O(n) operation to O(1) operation.

DJIT+ essentially maintains following vector clocks: Firstly, each thread t keeps a vector clock Ct such that for any thread u, Ct(u) record the clock of the last operation of thread u that happens before the current operation of thread t. Clock of every thread is incremented at each lock release operation. Secondly, Lm is a vector clock corresponding to each lock and when a thread u releases a lock m, DJIT+ algorithm updates Lm to Cu and if, the thread t subsequently acquires m then the algorithm updates Ct to be Ct

Lm. Finally, to identify conflicting access, DJIT+ algorithm keeps two vector clocks Rx and Wx that record the clock of thread that last read and write x from every thread respectively.

Using these clocks DJIT+ determines a read access to x by thread u to be race free, if it happens after the last write by all the threads (Wx < Cu) and a write access to x by thread u is race free provided it happens after all access (read and write) to that variable (i.e Wx < Cu and Rx < Cu). Vector clocks are updated on synchronization operation that impose happens-before order between different threads. DJIT+ uses the full generality of the vector clocks thereby leads to overhead of O(n) in space as well as time where n is the number of threads.

FastTrack is a vector clock based dynamic data race detector that provides same precision as DJIT+ but significantly improves the performance and space overhead of maintaining and updating multiple vector clocks. FastTrack works on the premises that the full generality of vector clocks as used in DJIT+is not required for detecting data races. Essentially, FastTrack switches effectively between vector and epoch (summary information from vector clock faded into a scalar) in order to reduce expensive O(n) operation to O(1) operations as much as possible in the quest of taming the overheads of maintaining full vector clock wherever possible for memory read and memory write operations.

In order to reduce the overheads, FastTrack keeps the summary of a vector clock in the form of epoch. An epoch is denoted as c@t. An epoch c@t happens before a vector clock V iff c<=V(t). For each variable FastTrack maintains write epoch which essentially is the clock value of last thread that wrote x and for a read it adaptively switches between read epoch (clock value of last variable that read x) as well as completely general vector clock. A shrewd observation exploited by FastTrack to improve over DJIT+ is that just write epoch suffices instead of a write vector clock because writes to a variable are actually totally ordered. FastTrack also observes that in a race-free program, upon a write, all previous reads must happen before the write, so FastTrack adaptively switches from read epoch to read vector clock and from read vector clock to read epoch whenever necessary. For example, it switches from epochs to vector clocks, when it has to distinguish between multiple concurrent reads, since they all potentially race with a subsequent write. When reads are ordered by the happens-before relation, FastTrack uses an epoch for the last read otherwise, it uses a vector clock for reads.

On each read access by a thread, FastTrack simply checks that the read happens after the last write by comparing with the Write epoch of the variable and this is a fast O(1) operation as compared to O(n) operation of DJIT+. On each write access by a thread, FastTrack first checks the conflicts with earlier write by comparison with the write epoch of x which is also O(n) operation and is not very expensive from performance or space point of view. However, in order to check the read-write kind of race FastTrack also compare with the read vector clock to detect, if there is a race with any of the reads happening before this write which is in general a relatively slow O(n) operation. However, FastTrack is able to reduce the overheads of keeping the full vector clock for fully ordered read such as thread local and lock protected data. In other cases, where reads are not completely ordered FastTrack adaptively switches to vector clock for read operations. Thus it indicates that, FastTrack reduces the general O(n) space and time overheads of vector clock such as in DJIT+ to O(1) for memory reads completely and memory writes partially. However, it does not make any attempt to reducing O(n) overheads of synchronization operation that happen say during acquire and release, and are on rise because of more number of threads in upcoming multi-cores and GPGPUs.

FIG. 1 illustrates an example of functioning, FastTrack dynamic data race detector. Consider that there are 3 threads executing concurrently and accessing the variables x and y with the initial vector clock values as <1,0,0>, <0,1,0>, and <0,0,1> . The reads on x by the 3 threads are not ordered by happens before, so both FastTrack and DJIT+ store all the 3 in a vector clock.

When the operation release on lock m is performed, vector clock of T₃ is copied to Lm in this case <0,0,1> is copied to Lm (this operation takes an O(n) time) and then T₃ increments its clock to <0,0,2>. When T₁ acquires Lm it performs a join operation with the vector clock of Lm, so the new vector clock is updated to <1,0,1>. When write on x is observed in the trace at T₁, FastTrack first checks for the write-write races and then read-write races. Since in this case there is no write-write race, as before there is no previous writes to x so it does not report any race. But there is a read-write race, since Rx on T₂ is not ordered by happens before with Wx in T₁. So, FastTrack checks for Rx happens before Wx at T₁, since they are not ordered by happens before, FastTrack reports a race. Now consider the access on the shared variable y, when the write access by T₃ happens, FastTrack updates the write epoch to 1@3 (since there are no access before this operation as there are no races). The next access on y is a write accesses by T₁, but WY at T₃ happens before WY at T₁ since there is synchronization operation rel(Lm) at T₃ followed by acq(Lm) at T₁. So the write epoch of x gets updated to 1@1. If the next access is by T₃ and is read access, then there is write-read race i.e RY at T₃are not ordered by happens before with WY at T₁ and so, FastTrack reports this race (It compares 1@1 at write epoch at x with 0@1 at T₃). Suppose WY at T₂ occurs after WY at T₁ in the trace then there is a write-write race and FastTrack reports this race (It compares 1@1 at write epoch of x with 0@1 at T₂).

FastTrack claims significant performance and space improvement over the DJIT+ algorithm and represent the state-of-the-art in vector clock based dynamic data race detector. Though, FastTrack improves the time and space overhead for memory operations (completely for write operations and partially for read operation), the synchronization operation such as acquire/release for maintaining and updating vector clocks also have considerable overheads of the O(n). These overheads are further going to increase with more and more threads contending in multi-cores and GPGPUs over the shared data.

The principal object of embodiments herein is to reduce overheads of maintaining and updating vector clocks during synchronization for dynamic data race detection.

Another object of embodiments herein is to orthogonally improve the performance of vector clock based dynamic data race detection over the state-of-the-art techniques without affecting the precision of dynamic data race detection by maintaining and updating the vector clocks for synchronization operation.

Accordingly embodiments herein provide a method for reducing overheads orthogonally during synchronization of threads in a vector clock based dynamic data race detection system. The method comprises opportunistically reducing the complexity of updating clock values during a thread synchronization operation.

One embodiment herein provides a method A method for reducing overheads orthogonally during synchronization in a vector clock based dynamic data race detector between a first thread and a second thread using a lock when said second thread is acquiring said lock from said first thread, by updating entire vector of clock values in said second thread with corresponding maximum clock value for each thread where said maximum clock value for each thread is obtained by comparing clock value for each thread in said lock, the method characterized by maintaining previous version value in each among said threads being monitored, where said previous version of a thread among said threads being monitored is a version after which there are no updates from any thread other than said thread; maintaining previous version value in each lock, where said previous version is the version of a thread that last released said lock; checking for a condition, if previous version value of said first thread is not more than version value of said first thread in version vector of said second thread; and when previous version value of said first thread is not more than version value of said first thread in version vector of said second thread, updating the clock value of said first thread to said second thread and retaining clock values of threads other than said first thread without updating.

Embodiments herein also disclose a system for perform various methods disclosed herein.

These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.

This invention is illustrated in the accompanying drawings, through out which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings, in which:

FIG. 1 illustrates an example of functioning, FastTrack dynamic data race detector in the context of the invention,

FIG. 2 illustrates handling of synchronization operations in dynamic data race according to prior art,

FIG. 3 illustrate handling of synchronization operations in dynamic data race according to embodiments disclosed herein,

FIG. 4 illustrates a thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to one embodiment,

FIG. 5 illustrates the thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to another embodiment,

FIG. 6 illustrates the thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to yet another embodiment,

FIG. 7 illustrates the thread interaction scenario associated with maintaining and updating vector clocks for synchronization operation, according to further embodiment, and

FIG. 8 and FIG. 9 illustrate an example computing environment that may be used in implementing the embodiments disclosed herein.

The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.

The embodiments herein enable a method and system to reduce overheads of maintaining and updating vector clocks during synchronization by opportunistically reducing complexity of operations from O(n) in time and space overheads of synchronization to O(1). Referring now to the drawings, and more particularly to FIGS. 1 through 9, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments.

Embodiments herein enable opportunistic reduction of O(n) time and space overheads of synchronization, exploiting the fact that there is temporal locality in the thread interactions. Essentially, threads tend to interact locally with each other over time and interaction is not completely haphazard in nature. If a thread TX has lock Li k times, where k>=1 before TY acquires Li and there is no thread TZ, where z!=x and z!=y which acquires Li between successive releases of TX followed by final acquire of TY, and the last update of TY was received from TX, then the expensive O(n) join operation can be converted to a O(1) join operation (for all but first Joins), where n is the number of threads.

Embodiments herein achieve the opportunistic reduction of complexity of join operations. The improvement is illustrated through the use of data structure of ThreadState and LockState representing state of a thread and state of a lock used by various threads respectively, according to an example implementation of a preferred embodiment showing improvement over an example implementation of FastTrack.

Structure for ThreadState and LockState as used by FastTrack:

class ThreadState {

int tid; //ThreadId

int C[]; //Vector of Clocks of all the threads maintained by each thread

int epoch; // clock value of tid (c@tid)

int Version[]; //Contains last version value of each thread at the time join with that thread

int Vepoch; //same as version value of tid(Version(tid))

int Pversion;

}

class LockState {

int C[]; //copy of vector clock of last thread that released the lock

int Vepoch; //same as version value of tid with tid (v@tid) is the thread last released the lock.

int Pversion;

}

Improved structure for ThreadState and Lockstate:

class ThreadState {

int tid; //ThreadId for this thread

int C[]; // Clocks of all threads maintained by each

// thread as its vector clock

int epoch; // clock value of tid (c@tid)

int Version[]; // version of each thread at the time of last join

int Vepoch; // Version Number of thread after which there is no

int Pversion; } // change in the clock of any other thread except this

// tid

class LockState {

int C[]; // vector clock copy of last thread that released the lock

int Vepoch; // version value of tid, last thread that released the lock

int Pversion; } //Pversion of last thread that released the lock

Some of the notations are explained further as follows:

Version is incremented every time there a change in vector clock

Versiont [1::n] is a Version vector of thread where each element of version vector is version value of corresponding thread.

Versiont[u] is the latest version received by the thread under consideration from the thread u that it joins.

L.Vepoch is maintained for a lock L is same as version value of tid that is the thread last released the lock.

Vepoch: (Version epoch v@t) is the current version v of thread t i.e (Versiont[t]) in the given vector clock.

Pversioni (previous version of Ti) : Denotes the version number of Ti after which there is no change in the clock of any other thread except that of Ti in the vector clock maintained by Ti

The improved versions of ThreadState and LockState introduce the variables Pversion, representing the previous version of a thread. Pversioni represents the previous version of thread Ti and denotes the version number of Ti after which there is no change in the clock of any other thread except that of Ti in the vector clock maintained by Ti. Therefore, each thread maintains at least the following metadata:

- Version of a thread is a scalar, is incremented every time there is a change in any element of the vector lock maintained by the thread. Versiont [1::n] is a Version vector of thread where each element of version vector, Versiont[u], is the latest version received by the thread t under consideration from the thread u that it joins.

- L.Vepoch is maintained for a lock L, is same as version value of tid, the thread last released the lock.

- Vepoch (Version epoch v@t) is the current version v of thread t i.e (Versiont[t]) in the given vector clock.

- Pversioni (previous version of Ti) denotes the version number of Ti after which there is no change in the clock of any other thread except that of Ti in the vector clock maintained by Ti.

- Let Li be a Lock and Ti be the thread that has last released Li. The corresponding vector clock (C), Ti.Vepoch, Pversion, thread id of Li are denoted by L.Ci, L.Vepochi , L.Pversioni ,L.Tidi, which are same as corresponding values of Ti at the time of release of Li by Ti.

The improvement brought about by the embodiments herein may be stated through the following Lemma:

- Let T₁ , T₂be two threads such that T₁ releases a lock L₁ at time t₁, which is next acquired by T₂ and T₁ releases a lock L₂ at time t₂ which is next acquired by T₂ where t₂ > t₁. If the Pversion1 does not change in between t₁ to t₂ then reduce the O(n) acquire operation by T₂ at t₂ to O(1) acquire operation.

The implementation of the aforementioned Lemma is described through the following illustration and subsequent examples:

When T₁ releases lock L₁ it takes O(n) time for copying C₁ to L₁.C . This is followed by an increment of C₁(1); i.e., the clock value of T₁ as maintained by T₁ in its vector C₁ is incremented. Further, Pversion1 is copied to L₁.Pversion.

Subsequently, when T₂ acquires the lock L₁ from T₁, FastTrack does an expensive O(n) join operation by checking if L₁.C(1) > C₂(1). However, embodiments herein avoid redundant O(n) join operation by checking if L₁.Pversion <= Version₂(1), to check if version value of thread T₁ after which there are no updates for any other threads is not more than version value of thread T₁ in the vector of thread T₂.

If the condition is true, that is if the current Pversion value of thread T₁ is not more than version value of thread T₁ as in the vector of thread T₂, it would mean that there was no acquisition of lock L₁ by thread T₁ or any other synchronization operation like join since last update, and that the last update of thread T₁ was received by thread T₂.

If the condition is false, that is if the current Pversion value of thread T₁ is more than version value of thread T₁ as in the vector of thread T₂, it would mean that there was another join operation involving threat T₁ and lock L₁ since the last join operation between threads T₁ and T₂, and that the last update to thread T₁ was not received by thread T₂.

If the condition is true, there would be no need to update the entire clock vector of thread T₂as only thread T₁ was updated since the last update to thread T₂. Therefore, in the improved method, the check is followed by updating Pversion2, C₂(1) and Version₂(1) to T₂.Vepoch, L₁.C(1), and L₁.Vepoch.

However, if the condition is false, the complex O(n) operation as performed by FastTrack would be adopted to update thread T₂.

EXAMPLE IMPLEMENTATION

The synchronization operation of obtaining a lock by thread T₂ from thread T₁ as performed by FastTrack may be illustrated using the following pseudo code:

Void join(ThreadState t, LockState m){

// O(n) operation to update all thread clock values

t.C[u] = max(t.C[u],m.C[u]) for all u;

}

As illustrated in FIG. 2, FastTrack always performs the synchronization operation with O(n) complexity.

The improved synchronization operation of obtaining a lock by thread T₂ from thread T₁ according to embodiments herein may be illustrated using the following pseudo code:

Void join(ThreadState t, LockState m){

if (m.L[u] > t.C[u] for any u) {

t.lastVersion = t.lastVersion + 1

t.Pversion = t.lastVersion //update Pversion

if(m.Pversion ≤ t.Version[u]) { //check for redundant join.

t.Version[u] = vepoch(m) // where u is m.tid

t.C[u]= m.C[u] //thread t clock is updated by clock of u

return; //avoid the O(n) path below and return

}

t.C[u]= max(t.C[u],m.L[u]) for all u; //expensive 0(n) path

}

As illustrated in FIG. 3, the improved method performs a check to reduce the complexity of the synchronization operation to O(1).

The improved method of performing synchronization may be illustrated further using the following examples:

EXAMPLE 1: In example 1, threads T₁, T₂, and T₃ are interacting as depicted in FIG. 4. When T₃ acquires Lx from T₁, the vector clock, version vector (Pversion) and version epoch of T₃ is updated that essentially takes O(n) time, where n is the number of threads. Similarly, when T₃ acquires Lz from T₂ then the vector clock, version vector (Pversion) and version epoch of T₃ is updated. Since there are no operations changing vector clock of T₁ between the release of Lx and Ly, except the clock and current version of T₁. So when T₃ acquires Ly, embodiments herein do O(1) check as Pversion of T₁ <= version of T₁ in T₃ and increment the T₃.Vepoch. Followed by update of C₃(1), Pversion3 and Ver3(1) to clock of T₁, T₃.Vepoch and current version of T₁ respectively, when T₁ released Ly. Similarly when T₃acquires Lw, perform O(1) operation. This is in contrast to the FastTrack method that performs O(1) check in case of last two acquires as explained above.

EXAMPLE 2: In example 2, consider threads T₁, T₂, and T₃ out of many active threads are interacting as depicted in FIG. 5. When T₂ acquires Lx from T₁, the vector clock, version vector (Pversion) and version epoch of T₂ is updated. This takes O(n) time ,where n is the number of threads. Similarly when T₃ acquires Ly from T₁ then the vector clock, version vector (Pversion) and version epoch of T₃ is updated. Because, there are no acquire operations that change the vector clock of T₁ between the Rel(Lx) and Rel(Lz), only the clock and version epoch of T₁ are modified. Thereafter, when T₃ acquires Lz, the present invention perform a simple check to see if the Pversion of T₁ at the time of release of Lz <= Version₁(3) (O(1) check) and increment the T₃.Vepoch. Then update the C₃(1), Pversion3 and Version₃(1) to clock of T₁, T₃.Vepoch and version epoch of T₁ at the time of release of Lz.. Similarly, when T₂ acquires La,. the present invention perform O(1) operation thereby reducing some O(n) operations to O(1), where n is the number of threads.

EXAMPLE 3: In example 3, consider four threads (T₁, T₂, T₃ and T₄) out of many active threads which interact as depicted in FIG. 6. Suppose the threads T₂ and T₃ are in separate loops then, one of the possible interleaving between the T₂ and T₃ can be as follows. T₂ acquires Lx and releases Lx followed by acquire of Lx and release of Lx by T₃. This is further followed acquire and release of Ly where y!=x by T₂ (assume that intial acquire of Ly by T₂ is redundant join O(1)) for 'k'times and acquire of Ly by T₃. In this scenario, when T₃ first acquires Lx when it is first released by T₂ it does an O(n) join operation.

Thereafter the all the subsequent consecutive acquire and release of Ly by T₂ only increments the clock and version epoch of T₂. The next time when T₃ acquires Ly, the present invention check to see if the Pversion of T₂ at the time of release of Ly i.e. Ly.Pversion <= Version₃(2) (O(1) check) and increment T₂.Vepoch , T₃.Vepoch, followed by updating C₃(2) and Version₃(2) to clock of T₂ and version epoch of T₂ at the time of release of Lx. This reduces O(n) operations to O(1).Similarly in a scenario of T₃ executing 'k' times followed by acquire of L₁ by T₂, thus present method reduce O(n) overheads to O(1).

EXAMPLE 4: In example 4, consider four threads T₁, T₂, T₃, and T₄ out of many active threads with interaction as depicted in FIG. 7. Initially threads T₂ and T₃ interact followed by the interaction between T₂and T₁, and followed by the interaction between T₃ and T₄.

Consider the interaction between T₂ and T₃, when the first release and acquire on LU is performed, the join operation which takes place at T₃ takes O(n) time . Next time, when T₃ acquires Lv, the present method check to see if the Pversion of T₂ at the time of release of LV. <= Version₃(2) (O(1) check) and increment the current version of T₃, followed by update of C₃(2), Version₃(2) to clock of T₂ and version epoch of T₂ at the time of release of Lv. This reduces O(n) operations to O(1). Similarly, the first release and acquire between T₃ and T₂ takes O(n) time and the subsequent release and acquire between T₃and T₂ takes O(1) time. Similarly, O(n) operations between T₁ , T₂, T₃, and T₄ are reduced. Thus, embodiments herein reduce many O(n) operations to O(1) operation.

FIG. 8 illustrates a computing environment implementing the application as disclosed in an embodiment herein. As depicted the computing environment comprises at least one processing unit that is equipped with a control unit and an Arithmetic Logic Unit (ALU), a memory, a storage unit, plurality of networking devices, and a plurality Input output (I/O) devices. The processing unit is responsible for processing the instructions of the algorithm. The processing unit receives commands from the control unit in order to perform its processing. Further, any logical and arithmetic operations involved in the execution of the instructions are computed with the help of the ALU. Processing unit can support more than one threads

FIG. 9 illustrates another computing environment implementing the application as disclosed in an embodiment herein. As depicted the computing environment comprises of more than one processing units that are equipped with a control unit and an array of Arithmetic Logic Units (ALUs) and a multilevel local memory (cache hierarchy). Additionally, the computing environments have a storage unit, plurality of networking devices, and a plurality Input output (I/O) devices. The processing units in this case can be same, similar or widely different in their capabilities and can support plurality of threads. The overall computing environment can be composed of multiple homogeneous and/or heterogeneous cores, multiple GPUs of different kinds, special media and other accelerators. The processing unit is responsible for processing the instructions of the algorithm. The processing unit receives commands from the control unit in order to perform its processing. Further, any logical and arithmetic operations involved in the execution of the instructions are computed with the help of the ALU. Further, the plurality of process units may be located on a single chip or over multiple chips.

The instructions and codes required for the implementation are stored in either the memory unit or the storage or both. At the time of execution, the instructions may be fetched from the corresponding memory and/or storage, and executed by the processing unit.

In case of any hardware implementations various networking devices or external I/O devices may be connected to the computing environment to support the implementation through the networking unit and the I/O device unit.

In some embodiments, the methods disclosed herein may be implemented as part of a thread library. In some embodiments, the methods disclosed herein may be implemented as part of a runtime system like a Just-In-Time compile system. In some embodiments the methods disclosed herein may be implemented as part of an Operating System (OS).

In some embodiments the methods disclosed herein may be made use of by a hardware system with specific instruction set architecture. Such a hardware system may use specific registers for storing state information of threads. The storing of state information may happen in the register memory or on an external system memory.

In some embodiments, the methods disclosed herein may be implemented in a multi-thread embedded system environment.

In various embodiments, the methods for reducing overheads during synchronization operations may further be enhanced for certain systems by performing sampling of thread interactions. Embodiments disclosed herein suggested monitoring all thread interactions. However, as number of threads and thread interactions grow, there may be a need to sample thread interactions to reduce overheads. Further, sampling of thread interactions may be implemented in systems that have severe memory usage restrictions during runtime.

The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device. Therefore, it is understood that the scope of the protection is extended to such a program and in addition to a computer readable means having a message therein, such computer readable storage means contain program code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The method is implemented in a preferred embodiment through or together with a software program written in e.g. Very high speed integrated circuit Hardware Description Language (VHDL) another programming language, or implemented by one or more VHDL or several software modules being executed on at least one hardware device. The hardware device can be any kind of portable device that can be programmed. The device may also include means which could be e.g. hardware means like e.g. an ASIC, or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. The method embodiments described herein could be implemented partly in hardware and partly in software. Alternatively, the invention may be implemented on different hardware devices, e.g. using a plurality of CPUs.

The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein.

Claims

A method for reducing overheads orthogonally during synchronization of threads in a vector clock based dynamic data race detection system, said method comprising

opportunistically reducing the complexity of updating clock values during a thread synchronization operation.
The method as in claim 1, wherein said method opportunistically reduces complexity of said synchronization operation from O(n) to O(1) wherein n represents the number of threads being monitored.
A method for reducing overheads orthogonally during synchronization in a vector clock based dynamic data race detector between a first thread and a second thread using a lock when said second thread is acquiring said lock from said first thread, by updating entire vector of clock values in said second thread with corresponding maximum clock value for each thread where said maximum clock value for each thread is obtained by comparing clock value for each thread in said lock, said method characterized by

maintaining previous version value in each among said threads being monitored, where said previous version of a thread among said threads being monitored is a version after which there are no updates from any thread other than said thread;

maintaining previous version value in each lock, where said previous version is the version of a thread that last released said lock;

checking for a condition, if previous version value of said first thread is not more than version value of said first thread in version vector of said second thread; and

when previous version value of said first thread is not more than version value of said first thread in version vector of said second thread,

updating the clock value of said first thread to said second thread and

retaining clock values of threads other than said first thread without updating.
The method as in claim 3, wherein said method opportunistically reduces complexity of said synchronization operation between said first thread and second thread from O(n) to O(1) wherein n represents the number of threads being monitored.
The method as in claim 3, wherein said method comprises sampling thread interactions before checking for said condition to reduce overhead.
A system for performing a method according to at least one of claims 1 to 5.
The system as in claim 6, wherein said system is a single processor system.
The system as in claim 6, wherein said system is a multi-processor system.
The system as in claim 6, wherein said system is a homogeneous processor system.
The system as in claim 6, wherein said system is a heterogeneous processor system.
A computer program product embodied in a computer readable medium including program instructions which when executed by a processor cause the processor to perform a method for reducing overheads orthogonally during synchronization of threads in a vector clock based dynamic data race detection system, said method comprising

opportunistically reducing the complexity of updating clock values during a thread synchronization operation.
The computer program product as in claim 11, wherein said method opportunistically reduces complexity of said synchronization operation from O(n) to O(1) wherein n represents the number of threads being monitored.
A computer program product embodied in a computer readable medium including program instructions which when executed by a processor cause the processor to perform a method for reducing overheads orthogonally during synchronization in a vector clock based dynamic data race detector between a first thread and a second thread using a lock when said second thread is acquiring said lock from said first thread, by updating entire vector of clock values in said second thread with corresponding maximum clock value for each thread where said maximum clock value for each thread is obtained by comparing clock value for each thread in said lock, said method characterized by

maintaining previous version value in each among said threads being monitored, where said previous version of a thread among said threads being monitored is a version after which there are no updates from any thread other than said thread;

maintaining previous version value in each lock, where said previous version is the version of a thread that last released said lock;

checking for a condition, if previous version value of said first thread is not more than version value of said first thread in version vector of said second thread; and

when previous version value of said first thread is not more than version value of said first thread in version vector of said second thread,

updating the clock value of said first thread to said second thread and

retaining clock values of threads other than said first thread without updating.
The computer program product as in claim 13, wherein said method opportunistically reduces complexity of said synchronization operation between said first thread and second thread from O(n) to O(1) wherein n represents the number of threads being monitored.
The computer program product as in claim 13, wherein said method comprises sampling thread interactions before checking for said condition to reduce overhead.