«

»

Domain-Specific Language for Checkpointing

Language motivation and goals

DSLs are specialized languages that are written for a particular application-domain. DSLs are more expressive in a given domain than General-Purpose Languages (GPLs) but unlike the GPLs, they have limited features and applicability. Because the DSLs are more specialized and expressive than other GPLs, they are easy to learn and use. The usage of DSL results in the increase in productivity and decrease in software development time and cost.

Checkpointing is a technique that is used to insert the fault-tolerance mechanism in the applications. In the checkpointing process, the application image is stored in a persistent way so that the application can be stopped, restored, and restarted without any noticeable difference in the execution when compared to the smooth uninterrupted execution of the application. In the distributed and heterogeneous environments like the Grid, the resource availability changes dynamically. The resources might even suffer from network or system failure. Therefore, it helps to have a mechanism for migrating the jobs to other available nodes and restarting them from where they stopped on the previous resources. This would not only make the application reliable and fault-tolerant but would also help in gang scheduling and avoiding wastage of time and resources (by enabling application restart from the stored images instead of starting the application from scratch). All of the above mentioned benefits are also applicable to parallel programs being run over heterogeneous clusters. The state of a parallel program is dependent upon program variables, process states, and the state of the interconnect. While embedding checkpointing in these programs, it should be ensured that no messages are lost or duplicated during the restore phase. Checkpointing should ensure consistency in the application results without any drastic reduction in the performance.

The checkpointing techniques fall into three main categories viz., kernel-level user-level and application-level. Kernel-level checkpointing entails periodic core-dumps of the machine state, is operating system dependent and lacks portability. In the user-level checkpointing approach, the system calls are intercepted to keep track of the program state and the operating system remains unaware of the CaR mechanism. In the Application-Level Checkpointing (ALC), the CaR mechanism is directly inserted into the application.

This research falls under the ALC and the user is responsible for selecting the application variables for which the image needs to be created and is responsible for inserting the checkpointing code into the application. This approach, therefore, gives more power for selective checkpointing to the user and helps in overcoming the inconsistencies introduced due to different operating systems in a heterogeneous environment. With the ALC, the memory used and the data written to the restart files is less when compared with other approaches.

During the course of this research, it was observed that the algorithm and concept of the ALC remain consistent across various applications but the details like APIs used, frequency of checkpointing, and the code section to be checkpointed varies from application to application. There is a pattern in which the application-level CaR is implemented. Therefore, the concept of checkpointing and its usage pattern can be abstracted in high-level language constructs to promote reusability, code correctness and increased expressiveness. Apart from separating the commonalities and variations associated with the ALC, a high-level language can also address the issue of coupling between the problem space and the solution space. As the solution space for ALC evolves, more and more off-the-shelf APIs and tools will be available. However, the problem space and the checkpointing specifications, that is, the place in the application where checkpointing or restart mechanism is required, the application variables that need to be checkpointed, and the frequency of checkpointing might not change for an application. A high-level language is therefore required in this scenario for separating the specifications for checkpointing from its implementation so that the solution and problem space are decoupled.

Another issue that can be addressed by a high-level language is the invasive reengineering of large legacy applications to embed the checkpointing logic into it. The process of invasive reengineering of applications to insert the checkpointing mechanism is a challenging task because checkpointing is cross-cutting in nature, i.e., is spread across multiple modules. Because checkpointing involves extra read and write operations, in some cases, the checkpointed application may take conspicuously longer time to run than the non-checkpointed application. In the scenario in which performance is more critical than fault-tolerance, it might be useful for the stakeholder to have the ability to turn-off the checkpointing feature from the application. For the ease of code maintenance and evolution, it is also important to avoid creating multiple copies of the application (one with checkpointing and the other without checkpointing). When compared with the task of using a library for ALC, the use of a DSL separates the checkpointing logic from the original program and makes the software maintenance process simpler. In short, the existing application should not undergo any intrusive change and checkpointing should exist as a pluggable feature. The above mentioned issues were the main motivating factors behind the idea of developing a DSL for doing ALC.

This DSL is language independent and can be used to checkpoint applications written in any programming language. The user is only required to specify the checkpointing specifications through the DSL. The actual code responsible for the CaR is generated automatically from the specifications provided by the user and is non-intrusively woven into the source code of the existing application. Since the user is involved in identifying the places in the application where CaR mechanism is required, this is a semi-automatic approach for ALC. The benefits of semi-automatic code generation are the fine-grained control and the selectivity offered to the user during the code generation for the CaR mechanism. The DSL reduces the end-user effort for implementing the ALC mechanism in the legacy applications. The time and cost involved in inserting the ALC mechanism into large and complex legacy applications are also saved.

The process of code generation mentioned above can be best understood and developed using the Generative Programming (GP) techniques and tools. The GP implementation technologies and tools that were used in this research are Aspect-Oriented Programming (AOP) and program transformation techniques.

Language description

The main features of the DSL (expressed using Feature-Oriented Design Analysis)are:

ChckptgPack: one-of(Checkpoint, Restart)

Checkpoint: all (CheckPointCondition, CheckPointCode)

CheckPointCondition: all (Hook, Pattern, Frequency,loopVar?)

CheckPointCode: all(SaveVarType, saveVarArg)

SaveVarType: one-of (SaveInt, SaveDouble, SaveChar,SaveCharArray1D, SaveCharArray2D, SaveIntArray1D, SaveDoubleArray1D, SaveIntArray2D,

SaveDoubleArray2D)

Restart: all (RestartCondition, RestartCode)

RestartCondition: all(Hook, pattern)

RestartCode: all(ReadVarType, restartVarArg)

ReadVarType: one-of (ReadIntVarFromFile, ReadDoubleVarFromFile, ReadCharVarFromFile, ReadIntArray1DFromFile, ReadIntArray2DFromFile,

ReadDoubleArray1DFromFile, ReadDoubleArray2DFromFile, ReadCharArray1DFromFile, ReadCharArray2DFromFile)

Hook: all (HookType, HookElement)

HookType: one-of(afterHookType, beforeHookType,aroundHookType)

HookElement: one-of(Call, Execution, Statement)

Example of DSL code

Basic structure of the DSL code and format for providing the specifications is shown below. The structural elements are generated automatically and the place-holder for the variant part, which should be provided by the user, is depicted by “< >”. The specifications for the restart mechanism, RestartCondition and RestartCode features, should be provided in the code block following the beginInitialization keyword. The loopVar shown in the code below is an optional structural element. The features are specified through the GPL-neutral APIs provided in the DSL.

For checkpointing:

For restart:

Infrastructure Required

For writing and using the DSL:

For generating the checkpointing code from the DSL specifications:

People

  • Ritu Arora
  • Puri Bangalore
  • Marjan Mernik

Publications

  • Ritu Arora, Purushotham Bangalore, Marjan Mernik. A Technque for Non-Invasive Application-Level Checkpointing. The Journal of Supercomputing.
  • Ritu Arora, Purushotham Bangalore, Marjan Mernik. Developing Scientific Applications Using Generative Programming, Software Engineering for Computational Science and Engineering Workshop, International Conference on Software Engineering (ICSE 2009), Vancouver, Canada, May 17- May 23, 2009.
  • Ritu Arora, Marjan Mernik, Purushotham Bangalore, Suman Roychoudhury, and Saraswathi Mukkai. A Domain-Specific Language for Application-Level Checkpointing, accepted in the International Conference on Distributed Computing and Internet Technologies (ICDCIT 2008), New Delhi, India, December 10-13, 2008.

Poster

  • Ritu Arora. Raising The Level of Application-Level Checkpointing, in the ACM Student Research Competition at OOPSLA 2008, Nashville, Tennessee, October 19-23,2008.