Cloud == ?: June 2011

Since HTML5 is the hot thing these days, I figured I'd better get some hands on practice. However, let me first explain why I think HTML5 is a key technology. Browser-based technology is becoming more and more pervasive. We spend a significant portion of our time each day in browsers whether we are on our desktop, laptop, tablet or phone. However, there is still a fairly stark contrast between a browser-based application and a true native or 'rich' application. Browser applications typically have limitations such as only working when you are connected, having minimal video and graphics rendering support and also limited data storage options. Flash and other native additions to browsers have been built to try and compensate for these shortcomings but they are non-standard and usually closed systems that only cover a subset of browsers or operating systems. Oh yeah, and Steve Jobs appears to not like Flash...

Okay, you understand - browsers are the limiting factor today. However, the HTML5 specification changes all of that. Once browsers have full support for the various HTML5 components, you will be able to manage sql-like data storage, render complicated graphics (SVG, 2D / 3D drawing, and videos), operate off-line, and even provide context based information with geo-location. Browsers are quickly going to become our client application platforms.

So what do we need all this cloud stuff for then? Well, I agree that some of the traditional workload is going to shift from server to client given the new browser capabilities. However, I think there is also going to be a drastic shift in application development away from the 'native' approach. What's on your desktop today is going to be running in your browser tomorrow. Also, HTML5 isn't going to be the only driver in this movement. Native mobile apps are still very popular (given the limted computing power) and they are all going to depending on that same server infrastructure. In other words, I think the server side infrastructure supporting all these new applications is going to grow. And it's going to grow by a lot.

This is where the cloud and specifically Platform as a Service (PaaS) comes in. The cloud provides utility like resources on demand - a great way to quickly get all those servers you need. A Platform as a Service builds on that capability and further abstracts you from a lot of the traditional operational management you have to do. While far from a standardized service, a really good Platform as a Service should make you a really efficient at both developing and getting that code to production.

Have I convinced you? Well then let's get cookin! Before you get into all the whiz bang features of HTML5, is starts with the basics - markup and hosting. Since it's 2011, I'm going to use a PaaS for my hosting and even though I'm going to use all this newfangled markup for HTML5, I still need it to work in IE{6,7,8} and look decent. This leads me to HTML5 Boilerplate and OpenShift. HTML5 Boilerplate is going to make my next-gen development work in the browsers of today. OpenShift is going to be my PaaS. I'll use the OpenShift PHP runtime for this example but this blog is generally applicable to all the OpenShift runtimes (Perl, Ruby, Python, etc).

Step 1. Register on OpenShift

A free account on OpenShift will provide you with a free runtime environment for this demo. First, signup for a new account. You'll need to validate your email address after registering - just click the link in the email you get. Once your account is validated and you've gotten the note that you've been approved, you're off to the races - or more accurately, Step 2.

Step 2. Create your domain

A domain will be used in your url. I wanted http://<app>-nextgen.rhcloud.com so I ran:

rhc-create-domain -n nextgen -l <my-email>

Step 3. Create your application

I'll use PHP for this example and I wanted my URL to be http://html5-nextgen.rhcloud.com so I ran:

rhc-create-app -a html5 -t php-5.3

Step 4. Merge in HTML5 Boilerplate

Now I've got a html5 directory in my current directory. Time to pull in HTML5 Boilerplate. In short, this project has all the fanciness to get you off to the races with HTML5 included some chance of supporting older browsers as well. But... I want to pull in the latest from HTML5 Boilerplate into my 'php' directory which OpenShift sets up. Time for some git magic:

# Go into your app directory - mine is named 'html5'

cd html5

# Now get the HTML5 Boilerplate content

git remote add boilerplate git://github.com/paulirish/html5-boilerplate.git

git fetch boilerplate

git read-tree --prefix=php/ -u boilerplate/master

git commit -a -m "Merging in HTML5 Boilerplate"

Step 5. Switch over index.php

First, let's switch over to use the index.php file so we can use a php function for the example.

cd php

cp index.html index.php

git rm index.html

git commit -a -m "Switching over to use index.php"

Now, open up index.php and change the section after '<div id="main" role="main">' to add '<?php phpinfo(); ?>':

<?php phpinfo(); ?>

</div>

Don't forget to commit:

git commit -a -m "Added some fancy php info"

Step 6. Publish

Now, let's see how easy publishing changes can be.

git push

Yep, that's it - open your browser to your URL. Not convinced it's working? Highlight something. Yep, hot pink :) You are officially a OpenShift / HTML5 Boilerplate user now.

Experiment some on your own while I write my next blog post. Next one will be about creating something a little more involved with this base setup.

Referenced Projects

HTML5 Boilerplate

OpenShift

Well, let's first discuss which side of the Platform as a Service you are on. In this post I'm going to talk about the people building and running the Platform as a Service and why technologies like RPM's are still very, very relevant. However, a brief detour for the users of a PaaS. As a developer on OpenShift I don't want my users to have to worry about RPM's. Heck, I don't want them to have to worry about WARs, EARs, SARs, Gems or Eggs either. I want our users to spend as much time working with the thing they created - their source code. That is where our users put their creativity and hard work and I want to remove as many barriers as possible in their road from source code to running project. To our users I say 'You are correct, you will not need RPM's in our cloud!!' (crowd cheers...)

Now, onto the main focus group of this article - those maintaining a Platform as a Service or similar supporting infrastructure in the cloud. Service-side cloud technologies like Infrastructure as a Service (IaaS) bring a tremendous amount of power, but they also change some design patterns. Today's infrastructures need to be designed to respond. You no longer stand up 2 servers and throttle traffic. You spin up as many servers as you need and respond to your changing needs. If well done, your computing components appear very fluid.

Let's face it, this is seldom done well and it's a hard problem. One of the most common things I see go wrong is that people lose track of the underlying state of the system and fluidity quickly turns to chaos. The underlying architectures and process to support a truly fluid infrastructure actually have to be even more disciplined than they have ever needed to be in the past. Knowing exact state at any point in time is critical to the consistency of a cloud. It let's you make all the management decisions in a very dynamic manner - whether it's applying routine updates or coordinating a new major release.

In general, I break state into three sections: what developers control, what operations controls, and what the end users control. You understanding of each of these areas is extremely critical to being able to manage a highly dynamic system. In this post, I'm going to focus more on the first two - the sometimes contentious development / operations relationship.

Now, as developers we often focus a little too much on our Source Code Management (SCM) system. We spend out days in git or SVN, we tag or branch releases, and that's where we tend to put our effort. Sometimes... we even assume that all the important 'state' for a system is in the SCM... (crowd gasps...) I know... I've seen it happen. However, the stark reality is that at some point, code has to leave the SCM, and get packaged up to be deployed. It usually gets deployed in QA and Staging environments before it reaches Production. Developers - don't lose track of your code when it's leaves the nest! Keep track of that code!

For OpenShift, RPM's are a key part of keeping a handle on that transition. Wait! Aren't RPM's are old, mysterious, and a little bit evil? Look, I'll be the first to admit that I've had a long love / hate relationship with RPM's. It's not the easiest technology to learn and there is a little bit of magic to them. A great resource that will help demystify them is Maximum RPM and many of the Fedora pages like their macros page or their Ruby packaging guidelines but it's cumbersome reading. However, the real reason to use RPM's is that I just haven't found a better tool for the job. Yes, there is some pain that goes into the upfront process of getting everything into RPMs. It makes you really think through your packaging, permissions and system layout - stuff development often ignores. However, I promise if you put in this work upfront, you'll never go back. To try and convince you, let's talk about some of the things we actually use them for.

One nice thing about RPM's is that in addition to just a packaging specification, each machine maintains a database of what's installed. And it's not just a database of package names - it's includes all the details of how the software was installed. Let's go through some real OpenShift use cases. Real people, real packages, real questions...

Basic Investigation

"What package manages the file /usr/bin/rhc-create-app?"

[root@ip-10-85-70-89 ~]# rpm -qf /usr/bin/rhc-create-app

rhc-0.72.29-1.el6_1.noarch

We can walk from an installed file to a package. Pretty nice if you didn't do all the packaging and are doing some investigation.

"What's currently installed on hostXYZ?"
[root@ip-10-85-70-89 ~]# rpm -qa | grep rhc
rhc-cartridge-php-5.3-0.73.4-1.el6_1.noarch
rhc-devenv-0.73.3-1.el6_1.noarch
rhc-0.73.5-1.el6_1.noarch
rhc-cartridge-perl-5.10-0.4.5-1.el6_1.noarch
rhc-cartridge-rack-1.1-0.73.4-1.el6_1.noarch
rhc-server-common-0.73.6-1.el6_1.noarch
rhc-common-0.73.2-1.el6_1.noarch
rhc-cartridge-jbossas-7.0-0.73.6-1.el6_1.noarch
rhc-broker-0.73.5-1.el6_1.noarch
rhc-selinux-0.73.2-1.el6_1.noarch
rhc-cartridge-wsgi-3.2-0.73.4-1.el6_1.noarch
rhc-site-0.73.5-1.el6_1.noarch
rhc-node-0.73.5-1.el6_1.noarch

With one command, we can easily see the details about every custom component that we install. We use the prefix 'rhc-' for our packages to make this queries really easy.

"What other software does the 'rhc' package depend on?"

[root@ip-10-85-70-89 ~]# yum deplist rhc | grep -v provider:
Loaded plugins: product-id, subscription-manager
Updating Red Hat repositories.
Repository jenkins is listed more than once in the configuration
Finding dependencies:
package: rhc.noarch 0.73.5-1.el6_1
dependency: ruby >= 1.8.6
dependency: rubygem-parseconfig
dependency: /usr/bin/ruby
dependency: git
dependency: /usr/bin/env
dependency: rubygem-json

I filtered the output with grep to only get the dependencies. By default, it will show you what provides these dependencies as well. Now it's subtle but you'll notice I'm using yum here instead of RPM. Yum manages the relationships and metadata between packages. What your package needs installed to work, etc. Yum will also nicely manage the installation of all those dependencies for you too.

"Ahh, it needs Ruby. What version of Ruby are we running?"
[root@ip-10-85-70-89 ~]# rpm -q ruby
ruby-1.8.7.299-7.el6_1.1.x86_64

Because unfortunately there's a big difference between Ruby 1.8.6, 1.8.7 and 1.9...

Little More Advanced

"I don't remember if I installed the 'production' or 'development' rhc package..."

[root@ip-10-85-70-89 ~]# rpm -qi rhc | grep Signature

Signature : RSA/8, Thu 23 Jun 2011 04:05:57 PM EDT, Key ID 938a80caf21541eb

In our process, we only sign stuff going to production so since this RPM has a signature, it was from the real production build system, not a local laptop build. You can also use different signatures for each environment. That adds some overhead but is a nice way to tell where packages came from.

"Have any files been modified in that package?"

[root@ip-10-85-70-89 ~]# rpm -V rhc

S.5....T. c /etc/openshift/express.conf

The rpm man page has all the gory details on the format here but this basically says that this one file has changed. The 'c' denotes it as a config file and the other letters mean the [S]ize has changed, the MD[5] sum is different and the modified [T]ime is different.

Changed config files are pretty normal. If a binary file had changed, that might be another story...

Put On Your Seatbelt...

"How do I really know where that Signature came from?"
Well, let's figure it out. First, let's get the signature again.

[root@ip-10-85-70-89 ~]# rpm -qi rhc | grep Signature
Signature : RSA/8, Thu 23 Jun 2011 04:05:57 PM EDT, Key ID 938a80caf21541eb

Okay, so this thing has a signature Key ID of 938a80caf21541eb. Let's see what MIT's PGP server says about that key. Open up http://pgp.mit.edu and enter '0x938a80caf21541eb' in the search box. Don't forget that '0x' at the beginning of the string.

pub 4096R/F21541EB 2009-02-24 Red Hat, Inc. (beta key 2) <security@redhat.com>

Mark Cox Internal RSA 4096 test key <mjc@redhat.com>
Fingerprint=B08B 659E E86A F623 BC90 E8DB 938A 80CA F215 41EB

Okay, the security@redhat.com email address looks promising. But how do I really trust that? Well, let's just verify that fingerprint on Red Hat's site as well. Go to https://access.redhat.com/security/team/key/ and search on the page for the fingerprint:

B08B 659E E86A F623 BC90 E8DB 938A 80CA F215 41EB

You should see a match at the bottom of the page. Good news, the originator of this package was Red Hat.

"Is there anything installed on my system that isn't signed or I don't have a public key for?"

This fancy command is courtesy of Mike McGrath.

[root@ip-10-85-70-89 ~]# rpm -q --queryformat '%{NAME} %{SIGPGP:pgpsig}\n' -a | sort | egrep -v "$(rpm -qa gpg-pubkey* | awk -F'-' '{ print $3 }' | tr '\n' '\|' | sed 's/|$//')"

jboss-as7 (none)

jenkins (none)

maven3 (none)

mcollective-client (none)

mcollective-common (none)

...

"Wow. Can I completely depend on this for security?"

Technically someone could really exploit your system and alter your RPM DB to hide any changes. In those cases, you are probably looking at installing something like Tripwire to help even detect those cleanup efforts. Security is always sort of a cat and mouse game but I'm going to try and gracefully dodge the deep security questions since the focus of this article is on system state. Use RPM's for state but don't assume they give you a free pass to ignore security.

Linking this to Development
Now, the above gives you a great view into your operational state, but how do you tie this back into development? I can only really describe what we do since there are an infinite number of ways to approach this. First, we use git for our SCM and we use tito to standardize our link between the state of the code and the RPMs.

I won't go into why we use git too much. It's a distributed revision control system and it's wonderful. Enough said. Let's talk about tito though. The real mechanics that link source code to a RPM for us are git tags. We have a single git repository with lots of separate components in as top level folder. For example:

openshift/
/client
/site
/broker
...

Now there are lots of ways that you could approach tagging that would work and there are lots of ways you can build RPM's. The key is consistency - you want to mark the code at a point in time for a release. Then you just need to pick an approach and stick with it. Tito essentially tags the git repo for each RPM build that you do. Tito increments the RPM spec, puts in the comments, tags the git repository with the full package name (e.g. rhc-0.72.22-1) and submits the build. We use an internal Koji system to make sure we have reproducible RPM builds with all the dependencies nice and orderly.

With that approach, you can walk back to an exact state in the SCM just given a package version. Whether that package is running in production or any other environment, you know the exact code state that created it and you can also easily make a small patch to it and rebuild it. Development to operations with full visibility - it's a beautiful thing.

Honestly, this is just the tip of the iceberg. The point is that this diligence in development, packaging and versioning allows you to tell the exact, painful details about any running system. That understanding will make your releases smoother, updates easier to manage, and your users happier. And in the end, hopefully this knowledge will help your fluid architecture stay far away from chaotic.

Appendix

I was recently burned again by trying to send package names through the standard sort. This works fine for 0.1, 0.2 ... 0.9 but when you add 0.10, the normal sort puts it right after 0.1. Let's start with a simple example:

irb(main):055:0> a = ['0.1', '0.2']
=> ["0.1", "0.2"]
irb(main):056:0> a.sort
=> ["0.1", "0.2"]

Yep, that's what I would expect. Now, let's see what happens when I add '0.10'.

irb(main):053:0> a = ['0.1', '0.2', '0.10']
=> ["0.1", "0.2", "0.10"]
irb(main):054:0> a.sort
=> ["0.1", "0.10", "0.2"]

Ouch. Since each entry is treated like strings, it has no concept that the '10' is greater than '2'. This is further complicated by the fact that most packages are named with the convention package-major.minor.patch-revision (e.g. mypackage-0.1.23-2).

Since I always end up digging for a while to try and get the regex's and sorts right, I figured I would write it down this time.

First, let's build ourselves a package list:

pkg_list = 25.times.collect {|num| "mypackage-0.#{num+1}.1-1"}

Now, let's sort it:

pkg_list.sort_by {|pkg| /(.*)-(\d+)\.(\d+)\.(\d+)-?(\d)?/.match(pkg); [$1, $2.to_i, $3.to_i, $4.to_i, $5.to_i]}

This uses the regular express to match each component of the package name. Next, it returns the values in the proper forms in the correct priority order. For example, that array returned above essentially says 'compare name first, then major version (as an integer), the minor version (as an integer), then patch version (as an integer), then the revision (as an integer). This will make sort work and hopefully save you a bug down the road.

Cloud == ?

Wednesday, June 29, 2011

Because Everyone Loves HTML5...

Tuesday, June 28, 2011

Cloud, Packaging and State (Part 1 - Embrace RPM's)