Tuesday, June 28, 2011

Cloud, Packaging and State (Part 1 - Embrace RPM's)

Well, let's first discuss which side of the Platform as a Service you are on.  In this post I'm going to talk about the people building and running the Platform as a Service and why technologies like RPM's are still very, very relevant.  However, a brief detour for the users of a PaaS.  As a developer on OpenShift I don't want my users to have to worry about RPM's.  Heck, I don't want them to have to worry about WARs, EARs, SARs, Gems or Eggs either.  I want our users to spend as much time working with the thing they created - their source code.  That is where our users put their creativity and hard work and I want to remove as many barriers as possible in their road from source code to running project.  To our users I say 'You are correct, you will not need RPM's in our cloud!!'  (crowd cheers...)

Now, onto the main focus group of this article - those maintaining a Platform as a Service or similar supporting infrastructure in the cloud.  Service-side cloud technologies like Infrastructure as a Service (IaaS) bring a tremendous amount of power, but they also change some design patterns.  Today's infrastructures need to be designed to respond.  You no longer stand up 2 servers and throttle traffic.  You spin up as many servers as you need and respond to your changing needs.  If well done, your computing components appear very fluid.

Let's face it, this is seldom done well and it's a hard problem.  One of the most common things I see go wrong is that people lose track of the underlying state of the system and fluidity quickly turns to chaos.  The underlying architectures and process to support a truly fluid infrastructure actually have to be even more disciplined than they have ever needed to be in the past.  Knowing exact state at any point in time is critical to the consistency of a cloud.  It let's you make all the management decisions in a very dynamic manner - whether it's applying routine updates or coordinating a new major release.

In general, I break state into three sections: what developers control, what operations controls, and what the end users control.  You understanding of each of these areas is extremely critical to being able to manage a highly dynamic system.  In this post, I'm going to focus more on the first two - the sometimes contentious development / operations relationship.

Now, as developers we often focus a little too much on our Source Code Management (SCM) system.  We spend out days in git or SVN, we tag or branch releases, and that's where we tend to put our effort.  Sometimes... we even assume that all the important 'state' for a system is in the SCM...  (crowd gasps...)  I know... I've seen it happen.  However, the stark reality is that at some point, code has to leave the SCM, and get packaged up to be deployed.  It usually gets deployed in QA and Staging environments before it reaches Production.  Developers - don't lose track of your code when it's leaves the nest!  Keep track of that code!

For OpenShift, RPM's are a key part of keeping a handle on that transition.  Wait!  Aren't RPM's are old, mysterious, and a little bit evil?  Look, I'll be the first to admit that I've had a long love / hate relationship with RPM's.  It's not the easiest technology to learn and there is a little bit of magic to them.  A great resource that will help demystify them is Maximum RPM and many of the Fedora pages like their macros page or their Ruby packaging guidelines but it's cumbersome reading.  However, the real reason to use RPM's is that I just haven't found a better tool for the job.  Yes, there is some pain that goes into the upfront process of getting everything into RPMs.  It makes you really think through your packaging, permissions and system layout - stuff development often ignores.  However, I promise if you put in this work upfront, you'll never go back.  To try and convince you, let's talk about some of the things we actually use them for.

One nice thing about RPM's is that in addition to just a packaging specification, each machine maintains a database of what's installed.  And it's not just a database of package names - it's includes all the details of how the software was installed.  Let's go through some real OpenShift use cases.  Real people, real packages, real questions...

Basic Investigation

"What package manages the file /usr/bin/rhc-create-app?"
[root@ip-10-85-70-89 ~]# rpm -qf /usr/bin/rhc-create-app 

We can walk from an installed file to a package.  Pretty nice if you didn't do all the packaging and are doing some investigation.

"What's currently installed on hostXYZ?"
[root@ip-10-85-70-89 ~]# rpm -qa | grep rhc

With one command, we can easily see the details about every custom component that we install.  We use the prefix 'rhc-' for our packages to make this queries really easy.

"What other software does the 'rhc' package depend on?"

[root@ip-10-85-70-89 ~]# yum deplist rhc | grep -v provider:
Loaded plugins: product-id, subscription-manager
Updating Red Hat repositories.
Repository jenkins is listed more than once in the configuration
Finding dependencies: 
package: rhc.noarch 0.73.5-1.el6_1
  dependency: ruby >= 1.8.6
  dependency: rubygem-parseconfig
  dependency: /usr/bin/ruby
  dependency: git
  dependency: /usr/bin/env
  dependency: rubygem-json

I filtered the output with grep to only get the dependencies.  By default, it will show you what provides these dependencies as well.  Now it's subtle but you'll notice I'm using yum here instead of RPM.  Yum manages the relationships and metadata between packages.  What your package needs installed to work, etc.  Yum will also nicely manage the installation of all those dependencies for you too.

"Ahh, it needs Ruby.  What version of Ruby are we running?"
[root@ip-10-85-70-89 ~]# rpm -q ruby

Because unfortunately there's a big difference between Ruby 1.8.6, 1.8.7 and 1.9...

Little More Advanced

"I don't remember if I installed the 'production' or 'development' rhc package..."

[root@ip-10-85-70-89 ~]# rpm -qi rhc | grep Signature
Signature   : RSA/8, Thu 23 Jun 2011 04:05:57 PM EDT, Key ID 938a80caf21541eb

In our process, we only sign stuff going to production so since this RPM has a signature, it was from the real production build system, not a local laptop build.  You can also use different signatures for each environment.  That adds some overhead but is a nice way to tell where packages came from.

"Have any files been modified in that package?"
[root@ip-10-85-70-89 ~]# rpm -V rhc
S.5....T.  c /etc/openshift/express.conf

The rpm man page has all the gory details on the format here but this basically says that this one file has changed.  The 'c' denotes it as a config file and the other letters mean the [S]ize has changed, the MD[5] sum is different and the modified [T]ime is different.

Changed config files are pretty normal.  If a binary file had changed, that might be another story...

Put On Your Seatbelt...

"How do I really know where that Signature came from?"
Well, let's figure it out.  First, let's get the signature again.

[root@ip-10-85-70-89 ~]# rpm -qi rhc | grep Signature
Signature   : RSA/8, Thu 23 Jun 2011 04:05:57 PM EDT, Key ID 938a80caf21541eb

Okay, so this thing has a signature Key ID of 938a80caf21541eb.  Let's see what MIT's PGP server says about that key.  Open up http://pgp.mit.edu and enter '0x938a80caf21541eb' in the search box.  Don't forget that '0x' at the beginning of the string.

pub  4096R/F21541EB 2009-02-24 Red Hat, Inc. (beta key 2) <security@redhat.com>
Mark Cox Internal RSA 4096 test key <mjc@redhat.com>
Fingerprint=B08B 659E E86A F623 BC90 E8DB 938A 80CA F215 41EB

Okay, the security@redhat.com email address looks promising.  But how do I really trust that?  Well, let's just verify that fingerprint on Red Hat's site as well.  Go to https://access.redhat.com/security/team/key/ and search on the page for the fingerprint:

B08B 659E E86A F623 BC90 E8DB 938A 80CA F215 41EB

You should see a match at the bottom of the page.  Good news, the originator of this package was Red Hat.

"Is there anything installed on my system that isn't signed or I don't have a public key for?"
This fancy command is courtesy of Mike McGrath.

[root@ip-10-85-70-89 ~]# rpm -q --queryformat '%{NAME} %{SIGPGP:pgpsig}\n' -a | sort | egrep -v "$(rpm -qa gpg-pubkey* | awk -F'-' '{ print $3 }' | tr '\n' '\|' | sed 's/|$//')"
jboss-as7 (none)
jenkins (none)
maven3 (none)
mcollective-client (none)
mcollective-common (none)

"Wow.  Can I completely depend on this for security?"
Technically someone could really exploit your system and alter your RPM DB to hide any changes.  In those cases, you are probably looking at installing something like Tripwire to help even detect those cleanup efforts. Security is always sort of a cat and mouse game but I'm going to try and gracefully dodge the deep security questions since the focus of this article is on system state.  Use RPM's for state but don't assume they give you a free pass to ignore security.

Linking this to Development
Now, the above gives you a great view into your operational state, but how do you tie this back into development?  I can only really describe what we do since there are an infinite number of ways to approach this.  First, we use git for our SCM and we use tito to standardize our link between the state of the code and the RPMs.

I won't go into why we use git too much.  It's a distributed revision control system and it's wonderful.  Enough said.  Let's talk about tito though.  The real mechanics that link source code to a RPM for us are git tags.  We have a single git repository with lots of separate components in as top level folder.  For example:


Now there are lots of ways that you could approach tagging that would work and there are lots of ways you can build RPM's.  The key is consistency - you want to mark the code at a point in time for a release.  Then you just need to pick an approach and stick with it.  Tito essentially tags the git repo for each RPM build that you do.  Tito increments the RPM spec, puts in the comments, tags the git repository with the full package name (e.g. rhc-0.72.22-1) and submits the build.  We use an internal Koji system to make sure we have reproducible RPM builds with all the dependencies nice and orderly.

With that approach, you can walk back to an exact state in the SCM just given a package version.  Whether that package is running in production or any other environment, you know the exact code state that created it and you can also easily make a small patch to it and rebuild it.  Development to operations with full visibility - it's a beautiful thing.

Honestly, this is just the tip of the iceberg.  The point is that this diligence in development, packaging and versioning allows you to tell the exact, painful details about any running system.  That understanding will make your releases smoother, updates easier to manage, and your users happier.  And in the end, hopefully this knowledge will help your fluid architecture stay far away from chaotic.


I was recently burned again by trying to send package names through the standard sort.  This works fine for 0.1, 0.2 ... 0.9 but when you add 0.10, the normal sort puts it right after 0.1.  Let's start with a simple example:

irb(main):055:0> a = ['0.1', '0.2']
=> ["0.1", "0.2"]
irb(main):056:0> a.sort
=> ["0.1", "0.2"]

Yep, that's what I would expect.  Now, let's see what happens when I add '0.10'.

irb(main):053:0> a = ['0.1', '0.2', '0.10']
=> ["0.1", "0.2", "0.10"]
irb(main):054:0> a.sort
=> ["0.1", "0.10", "0.2"]

Ouch.  Since each entry is treated like strings, it has no concept that the '10' is greater than '2'.  This is further complicated by the fact that most packages are named with the convention package-major.minor.patch-revision (e.g. mypackage-0.1.23-2).

Since I always end up digging for a while to try and get the regex's and sorts right, I figured I would write it down this time.

First, let's build ourselves a package list:
pkg_list = 25.times.collect {|num| "mypackage-0.#{num+1}.1-1"}

Now, let's sort it:
pkg_list.sort_by {|pkg| /(.*)-(\d+)\.(\d+)\.(\d+)-?(\d)?/.match(pkg); [$1, $2.to_i, $3.to_i, $4.to_i, $5.to_i]}

This uses the regular express to match each component of the package name.  Next, it returns the values in the proper forms in the correct priority order.  For example, that array returned above essentially says 'compare name first, then major version (as an integer), the minor version (as an integer), then patch version (as an integer), then the revision (as an integer).  This will make sort work and hopefully save you a bug down the road.

No comments:

Post a Comment