Cloud Computing: Grid Applications and Infrastructure

Example: RAMS (Rapid Atmospheric Modeling System)

* Modeled mesoscale convective complex that dropped so much rain, in good agreement with recorded data
* used 5km spacing instead of usual 10 km
* ran on 256+ processors

HPC (High Performance Computing) : Computation-intensive computing
Typically HPC applications use lot of CPU resources, less data compared to data-intensive application, but lot more compute-intensive

Grid enables to run such programs without access to a supercomputer.

There might be several jobs in such applications, few of which can be run in parallel.
All jobs in an application are represented by DAG (Directed Acyclic Graph)
Example : o/p of Job0 serves as i/p to either Job1 or Job2; o/p of Job1 and Job2 serves as i/p to Job3
So, Job1 and Job2 can be executed parallely on different workstations.
These jobs are generally several GBs, but are mostly compute intensive

Each job may take several hours/days

Stages of a job

1. Init
2. Stage in
3. Execute
4. Stage out
5. Publish

Computation intensive, so massively parallel

The main question is allocation and Scheduling of the tasks among the grid resource or distributed workstations?

Scheduling problem

- DAG (Directed Acyclic Graph) of jobs to be scheduled across multiple sites

2-level scheduling infrastructure

1. Intra-site protocol

a. Internal allocation and scheduling (Which of the jobs run on which machines)
b. Monitoring (If a job fails, it needs to be started on another machine on the same site)
c. Distribution and publishing of files

Example: HTCondor protocol (High-Throughtput computing system from U. Wisconsin Madison)
Belongs to a class of Cycle-scavenging systems which
* run on a lot of workstations
* when workstation is free, ask site's central server (or Globus) for tasks
* If user hits a keystroke or mouse click, stop task by either killing the task or asking the server to reschedule the task
* can also run on dedicated machines

2. Inter-site protocol (Globus protocol)

Internal structure of different sites may be transparent (invisible) to Globus
It doesn't do the actual scheduling on the machines.
It basically does only external allocation and scheduling
Also responsible for stage in and stage out of files

* Globus Alliance involves universities, national US research labs and some companies
* Standardized several things, especially software tools
* Separately, but related: Open Grid Forum
* Globus Alliance has develiped Globus toolkit (contains standard tools to run inter-site protocol)

Globus Toolkit

* open source
* consists of several components
- GridFTP : Wide-area transfer of bulk data
- GRAM5 (Grid Resource Allocation Manager) : submit, locate, cancel and manage jobs
% not a scheduler
% Globus communicates with the schedulers in intra-site protocols like HTCondor or Portable Batch Systems (PBS)
- RLS (Replica Location Service): Naming service that translates from a file/dir name to a target location (or another file/dir name)
- Libraries like XIO to provide a standard API for all Grid IO functionalities
- Grid Security Infrastructure (GSI)

Security Issues

* Important in Grids because they are federated i.e. no single entity controls the entire infrastructure
* Single Sign-on : collective job set should require once-only user authentication
* Mapping to local security mechanisms : some sites use Kerberos, others using Unix
* Delegation: credentials to access resources inherited by subcomputations, eg job0 to job1
* Community authorization: eg third party authentication

* The above are important in clouds as well, but less so because clouds are typically run under a central control
* In clouds the focus is on failures, scale, on-demand access

Summary:

1. Grid computing focuses on computation-intensive computing (HPC)
2. Though often federated, architecture and key concepts have a lot in common with that of clouds. Grids need work in symphony and need lot of coordination while clouds are optimized to do independent work that is much less frequently coordinated. Clouds don't need as much tight control of what's going on as Grids.
3. Are Grids/HPC converging towards clouds? Eg: Compare OpenStack and Globus
These are different in standards being developed and conferences where papers get published
Look into Openstack and Globus on what are similar and what are different?

Cloud Computing

Friday, February 13, 2015

Grid Applications and Infrastructure