Wednesday, February 4, 2015

Introduction

Many Cloud Providers:

1. AWS
a. EC2 : Elastic Compute Cloud
b. S3  : Simple Storage Service
c. EBS : Elastic Block Storage, accessed by EC2 instances

2. Microsft Azure
3. Google Compute Engine
4. Rightscale, Salesforce, EMC, Gigaspaces, 10gen, Datastax, Oracle, VMWare, Yahoo, Cloudera, etc

Two Categories:

1. Private cloud : accessible only to company employees
2. Public cloud : service to paying customer

Advantages:

* Cloud computing is useful to save money and time to bring up new compute and/or storage instances
- A new server can be up and running in 3 minutes, unlike 7.5 weeks to deploy a server internally.
  A 64-node Linux cluster can be online in 5 minutes, unlike 3 months if deplyed internally

- Reduction in IT operational costs by roughly 30%

- A private cloud of virtual servers inside datacenter has saved Sybase $2 million annually because the company can share computing power and storage resources across servers

- Startups can harness large computing resources without buying their own machines

What is a cloud?

Cloud = Lots of storage + compute cycles nearby
Compute is brought closer to data rather than data being moved closer to compute

1. A single-site cloud (aka "Datacenter") consists of
a. Compute nodes (grouped into racks)
b. Switches, connecting the racks - many top of rack switches are connected to one core switch in a 2-level network topology
c. A network topology, e.g. hierarchical
d. Storage (backend) nodes connected to the network
e. Front-end for submitting jobs and receiving client requests
f. Software services

2. Geographically distributed cloud consists of multipe such sites and each site perhaps with different structure and services

History:

1. First data centers (1940-1960) - ENIAC, ILIAC - They occupied entire hall
2. Time sharing companies and data processing industry - punch cards as i/p and o/p (Honeywell, IBM, Xerox)
3. Clusters/Grids (1980-2012) - Personal computers (Cray, Berkeley NOW Project, Supercomputers, Server Farms (Eg Oceano), Bittorrent, GriPhyN
4. Clouds and datacenters (2000 - present) - similar to dataprocessing era but different workloads

Technology Trends 

1. Moore's law : CPU compute capacity doubles every 18 months; earlier it was CPU frequency, now it is number of cores
2. Storage doubles every 12 months
3. Bandwidth doubles every 9 months

User Trends

Biologists are producing PB/year of data which needs to be stored and processed


Prophecies:

1. Computer facility operating like a utility (power or water company)
2. Plug your thin client into the computing utility and play your favorite Intensive Compute and Communicate Application
Unix is a precursor for this vision


What's new in today's clouds?

1. Massive scale : Large datacenters
2. On-demand access : Pay-as-you-go, no upfront commitment
3. Data-intensive nature : TBs, PBs, XBs - daily logs, forensics, web data, compressed data
4. New cloud programming paradigms : MapReduce/Hadoop, NoSQL/Cassandra/MongoDB, etc

I. MASSIVE SCALE

Power is either off-site (hydro-electric or coal) or onsite(solar panels)
WUE = Annual Water Usage / IT Equipment Energy
PUE = Total Facility Power / IT Equipment Power

II. ON-DEMAND ACCESS : *AAS CLASSIFICATION

1. HaaS : Hardware as a Service - access to barebones hardware machines; but security risk
2. IaaS : Infrastructure as a Service - No security holes as HaaS. Flexible computing and storage infrastructure. Virtualization is a way for achieving this. Eg: AWS
3. PaaS : Platform as a Service - Flexible computing and storage infrastructure, couple with a software platform (not in terms of VMs), easier but less flexible than IaaS. Eg: Google AppEngine/Compute Engine
4. SaaS : Software as a Service - access to software services, when you need them. Eg: Google docs, MS Office on demand

III. DATA-INTENSIVE COMPUTING

1. Computation-Intensive computing - MPI-based, high-performance computing, Grids; typically supercomputers
2. Data-Intensive - store data at datacenters, use compute nodes nearby since movement of enormous amount of data would unnecessarily consume a lot of bandwidth, compute nodes run computation services; CPU utilization no longer the most important resource metric, instead I/O (disk and/or network) is.

IV. NEW CLOUD PROGRAMMING PARADIGMS

Easy to write and run highly parallel programs in new cloud programming paradigms
1. Google : MapReduce and Sawzall
2. Amazon : Elastic MapReduce service
3. Yahoo : Hadoop + Pig, WebMap
4. Facebook : Hadoop + Hive

Economics of Clouds:

2 categories of clouds - public vs private clouds
Outsource or Own?
Do cost analysis and determine break even points for duration that the cloud/service will be operational

No comments:

Post a Comment