EC2 Security Groups for the Web-UI Challenged

Anybody that has used the AWS-Console console has run across security groups when you create your AMIs, but until this weekend (when I wrote another simple script I hadn’t to interrogate them via Python Boto) I really had only scratched the surface.

So I knew (unlike traditional firewall rules) all you have is ACCEPT’s in the case of security groups. No blocks.

I also knew that these rules applied to all traffic destined for your AMIs, not just Internet traffic. I also knew that Rackspace Cloud servers has nothing like this, so you are left to manage the individual IPtables rules on all your services.

Here are some things that I should have known, but only discovered after digging deeper:

  • You can use security groups as a source to allow, not just CIDR blocks
  • You can use your customer ID as a source to allow. Not sure I’d want to use that.
  • The SecurityGroup API allows you to identify AMI’s that belong to the security group

For this blog I created for different security groups:

  1. bastion – a host that I use to SSH to get to my other AMI
  2. web – obviously the web servers that serve up 80/443
  3. database – MySQL servers that accept connections, hopefully from non-PHP web application servers.

You could create additional groups for you monitoring servers (which agent or SNMP based) to allow only communications on those ports. Pretty simple, eh?

So when I run my script (with no arguments) I can identify all my running hosts, as well as the rules that are included in the group.

mfranz@mfranz-x60s:~/Documents/Coding/awstools/ec2$ ./ec2fw.py
Connecting to EC2...

 --- default/sg-c28f6aab ---
[Instance Members]
[Rules]
	 SRC: [0.0.0.0/0] DST: tcp/22
	 SRC: [0.0.0.0/0] DST: tcp/80
	 SRC: [0.0.0.0/0] DST: icmp/-1 

 --- web/sg-04a5696c ---
[Instance Members]
	ec2-23-20-91-155.compute-1.amazonaws.com (10.209.122.118)
[Rules]
	 SRC: [bastion-ZZZZZZZZZ] DST: tcp/22
	 SRC: [0.0.0.0/0] DST: tcp/80
	 SRC: [0.0.0.0/0] DST: tcp/443
	 SRC: [0.0.0.0/0] DST: icmp/-1 

 --- bastion/sg-dea468b6 ---
[Instance Members]
	ec2-23-20-83-159.compute-1.amazonaws.com (10.194.250.137)
[Rules]
	 SRC: [0.0.0.0/0] DST: tcp/22
	 SRC: [0.0.0.0/0] DST: icmp/-1 

 --- database/sg-26cd014e ---
[Instance Members]
	ec2-107-22-124-9.compute-1.amazonaws.com (10.208.138.83)
[Rules]
	 SRC: [bastion-ZZZZZZZZZZZ] DST: tcp/22
	 SRC: [web-ZZZZZZZZZ] DST: tcp/3306

I obviously obfuscated my customer id. And since the only thing I like better than Python is generating .dot files with python, I added a plot option to output.

mfranz@mfranz-x60s:~/Documents/Coding/awstools/ec2$ ./ec2fw.py plot
Connecting to EC2...
"0.0.0.0/0"->"web"[label="tcp/80 tcp/443 icmp/-1"];
"bastion"->"database"[label="tcp/22"];
"0.0.0.0/0"->"default"[label="tcp/22 tcp/80 icmp/-1"];
"web"->"database"[label="tcp/3306"];
"bastion"->"web"[label="tcp/22"];
"0.0.0.0/0"->"bastion"[label="tcp/22 icmp/-1"];

Which when rendered, produces this output.

So I was shocked to find very few examples of using Boto to manage security groups. Because it is so easy?

I doubt it.

So here is an excerpt.

e = boto.connect_ec2()
for s in e.get_all_security_groups():
        print "\n --- %s/%s ---" % (s.name,s.id)
        print "[Instance Members]"
        for i in s.instances():
            print "\t%s (%s)" % (i.public_dns_name, i.private_ip_address)
        print "[Rules]"
        for r in s.rules:
            print "\t SRC: %s DST: %s/%s " % (r.grants,r.ip_protocol,r.to_port)

Grants contains the list of sources that are allowed. Because the __repr__ inside the class will include the CIDR block or the security group name, stripped out my customer id:

            for g in r.grants:
                if settings['sanitize']:
                    if g.__repr__().find('-') > -1:
                        src = g.__repr__().split('-')[0]
                    else:
                        src = g.__repr__()

Then I create a tuple for the source and destination security groups and then add the rules.

                p = (src, s.name)
                if p[0]:
                    if p not in pairs:
                        pairs.append(p)
                        flows[p] = []
                    flows[p].append(r.ip_protocol + "/" + r.to_port)

This allows me to then generate the .dot output quite easily.


    for f in flows.keys():
        rules =  " ".join(flows[f])
        label =  '[label="' + rules + '"];'
        print quotify(f[0]) + "->" + quotify(f[1]) + label

Bootstrapping LXC on EC2 with Boto, Paramiko, and Chef

By Why?
Given how much I love Python and container based virtualization, I thought it would a fun exercise  to automatically provision an EC2 instance and start building LXC containers.

I started by modifying an old script I blogged about a while back. My new script is available here. There are still some bugs but it works most of the time and it definitely gets the changes across.

Among the changes I made was to pick an Ubuntu Natty AMI instead of the cheap AWS Linux I usually use, so I used the Ubuntu Cloud Portal to find the right AMI to pass to Boto for an East Coast t1.micro instance running 11.04, which I find to be the most flexible platform for running LXC.

AMI = "ami-e358958a"
AMI_TYPE = "t1.micro"

Another change was to SSH into the AMI to install the necessary packages, so I needed to SSH into the instance. I knew that Paramiko was used by Fabric (which I already had installed) I thought I’d give that a try.

try:
    import paramiko as ssh
except:
   import ssh

Next I want to send commands to install git, which I need to install my LXC Chef Recipe which I’ll get to later.

sclient = ssh.SSHClient()
sclient.set_missing_host_key_policy(ssh.AutoAddPolicy())
sclient.connect(i.public_dns_name, username='ubuntu', key_filename=os.environ['AWS_SSH_PEM'])
(stdin, stdout, stderr) = sclient.exec_command('sudo apt-get -y update; sudo apt-get -y install git-core')

Unlike some of the blog posts suggested, I had no problem running sudo from within an exec_command.

On to Chef
Given that I hate puppet and it is painful to send multiple command more complex that what I did above, I thought it would be a good idea to use chef-solo to install the necessary packages and create the cgroup mount point, but the problem I ran into is that chef Ubuntu 11.04 package prompts the user so you have to set theDEBIAN_FRONTEND=noninteractive variable for any automated package installs. So I created a quick shell script that I would sftp up to the instance.

chef_bootscript="""#!/usr/bin/env bash
export DEBIAN_FRONTEND=noninteractive
apt-get -y update
apt-get -y install chef
mkdir -p /var/chef/cookbooks
git clone git://github.com/mdfranz/cookbooks.git
cp -av cookbooks/lxc /var/chef/cookbooks/
"""

I wanted a single file to do this (vs. having to maintain a separate .sh) so I created a tempfile that I then uploaded and executed.

boot_file = tempfile.NamedTemporaryFile()
boot_file.write(chef_bootscript)
boot_file.flush()

sftpclient = sclient.SFTPClient.from_transport(s.get_transport())
sftpclient.put(boot_file.name,"/tmp/chefboot.sh")

# Install Chef
(stdin, stdout, stderr) = s.exec_command('sudo sh /tmp/chefboot.sh')

This obviously creates the cookbooks directory installs chef and then pulls down

My First Humble Chef Recipe

My LXC recipe does only three things, installs the dependent packages, creates the cgroup mount point, and then mounts it, if it is not already mounted.

package("lxc")
package("debootstrap")
package("libvirt-bin")

directory "/cgroup" do
    owner "root"
    group "root"
    action :create
end

execute "mount" do
    command "mount none -t cgroup /cgroup"
    user "root"
    not_if "mount | grep cgroup"
end

Simple eh? I tried to unsuccessfully use a mount resource but these did not work. That was the longest.

So what does it look like
So I’ve successfully run this on OSX Lion and Ubuntu Oneric (both with Python 2.7) but it should work on others


mfranz@mfranz-x60s:~/Documents/Coding/awstools$ set | grep AWS | cut -d"=" -f1
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SSH_PEM

The first two environment variables are obvious, the last one points at your SSH key that you set in the AWS Console.

mfranz@mfranz-x60s:~/Documents/Coding/awstools/ec2$ ./ec2lxc.py start
Connecting to EC2...
Creating instance in: us-east-1
Launch time: 2012-02-11T01:26:13.000Z
State: pending
State: pending
State: pending
State: pending
State: pending
State: running
Host name assigned: ec2-23-20-72-27.compute-1.amazonaws.com
Attempting to connect to: ec2-23-20-72-27.compute-1.amazonaws.com
Cannot connect to ec2-23-20-72-27.compute-1.amazonaws.com waiting for 4 seconds
Attempting to connect to: ec2-23-20-72-27.compute-1.amazonaws.com
Cannot connect to ec2-23-20-72-27.compute-1.amazonaws.com waiting for 6 seconds
Attempting to connect to: ec2-23-20-72-27.compute-1.amazonaws.com
Connected to: ec2-23-20-72-27.compute-1.amazonaws.com
Creating:
/tmp/tmpMJE4eN
Creating:
/tmp/tmpTFChEU
['Ign http://security.ubuntu.com natty-security InRelease\n', 'Get:1 http://security.ubuntu.com natty-security Release.gpg [198 B]\n', 'Get:2 http://security.ubuntu.com natty-security Release [39.8 kB]\n', 'Ign http://us-east-1.ec2.archive.ubuntu.com natty InRelease\n', 'Ign http://us-east-1.ec2.archive.ubuntu.com natty-updates InRelease\n', 'Hit http://us-east-1.ec2.archive.ubuntu.com natty Release.gpg\n', '
[...]
INFO: Starting Chef Solo Run\n', '[Sat, 11 Feb 2012 01:27:48 +0000] INFO: Replacing the run_list with "lxc" from JSON\n', '[Sat, 11 Feb 2012 01:27:48 +0000] INFO: Installing package[lxc] version 0.7.4-0ubuntu7.2\n', '[Sat, 11 Feb 2012 01:27:52 +0000] INFO: Installing package[debootstrap] version 1.0.29ubuntu1\n', '[Sat, 11 Feb 2012 01:27:54 +0000] INFO: Installing package[libvirt-bin] version 0.8.8-1ubuntu6.7\n', '[Sat, 11 Feb 2012 01:28:02 +0000] INFO: Creating directory[/cgroup] at /cgroup\n', '[Sat, 11 Feb 2012 01:28:02 +0000] INFO: Setting owner to 0 for directory[/cgroup]\n', '[Sat, 11 Feb 2012 01:28:02 +0000] INFO: Setting group to 0 for directory[/cgroup]\n', '[Sat, 11 Feb 2012 01:28:02 +0000] INFO: Ran execute[mount] successfully\n', '[Sat, 11 Feb 2012 01:28:02 +0000] INFO: Chef Run complete in 15.94268 seconds\n']
mfranz@mfranz-x60s:~/Documents/Coding/awstools/ec2$ ./ec2lxc.py stop
Connecting to EC2...
Host: ec2-23-20-13-118.compute-1.amazonaws.com running
State: stopping
State: shutting-down
State: shutting-down
State: shutting-down
State: shutting-down
Host: ec2-23-20-72-27.compute-1.amazonaws.com running
State: stopping
State: shutting-down
State: shutting-down
State: shutting-down

As you can see it shut down both VMs, the one I just created now and the other one I left running from last weekend. I hate it when that happens.

Auditing your S3 Buckets with Boto & PyCurl

So I just committed a quick Python script that I had been trying to write in Node.js over the weekend but I’d been struggling with the half-dozen or so AWS APIs for Node, none of which where as fully developed or well-documented as Python Boto.

The intent is somewhat similar to Bucket Finder except that instead of trying to crawl other folks public buckets, my script allow you to specify the API keys (via an environment variable) so that you can pull down a complete list of all  buckets, then attempts to connect to them using PyCurl without authentication and records the HTTP response code, which should be 403 for all non-public buckets. The only real tricky thing I ran into was the bucket names are unicode and Curl only accepts normal strings.

#!/usr/bin/env python

import sys,os,pycurl
from boto.s3.connection import S3Connection 

options = { "transport" : "http" }
aws_keys = []
urls = []  # all the URLs we've gathered from your buckets
preauth_response = {}

if __name__ == "__main__":

    if len(sys.argv) == 1:
        print "Usage\n\t./audbuck crawl"
    elif sys.argv[1] == "crawl":
        if len(sys.argv) > 2: 
            # Read in keys from a .csv
            pass
        else:
            aws_keys.append( (None,os.environ['AWS_ACCESS_KEY_ID'],
                            os.environ['AWS_SECRET_ACCESS_KEY']) )
        for k in aws_keys:
            c = S3Connection(k[1],k[2])
            server = c.host
            
            for b in c.get_all_buckets():
                print b.name
                for f in b.get_all_keys():
                    print "- %s" % f.name
                    url = "%s://%s/%s/%s" % (options['transport'],server,b.name,f.name)
                    urls.append(url.rstrip())

        for u in urls:
            pc = pycurl.Curl()
            pc.setopt(pycurl.URL,str(u))
            pc.perform()
            code = pc.getinfo(pycurl.HTTP_CODE)
            preauth_response[u] = code
    else:
        usage()

Why You Must Conduct Hands-On Technical Screens

A Decade+ of AFK Interview Experience

I’m not ashamed to say that I’ve interviewed  at a lot of different shops over the years: Microsoft, RedHat, Amazon, Google, and a whole lot more. None of these companies ever had me type on a keyboard as part of the interview process–although truth be told I stopped the process after the initial screen for Google, so perhaps they would have.

In fact, very few places I’ve interviewed ever had me do anything on the keyboard or remotely close to what I thought I might be doing there. One of the few  exceptions was my first tech job at Trident in San Antonio. As part of the interview for a training position I gave a sample lesson on ipfwadm (yes it was that long ago) but I digress.

On the other side of the table, I’ve interviewed literally hundreds of candidates over the last decade. At Cisco, I did have folks occasionally do some keyboard time during the onsite. We also had a pretty solid paper exam, too. I’d give them a strange .pcap and see what they could do with it as way of assessing their network and protocols skills. I did the same thing when I interviewed folks at SAIC, depending on the candidate. Sometimes I also would stick them on a router or firewall to see how fluent they really were or see how fast they could learn Python at the REPL, if they said they could code–or if they were a completely green candidate I was judging for “potential.”

The conclusion I’ve come to as a hiring manager is that you MUST see how somebody will react in interview situation at the command-line for any sort of technical position.  Paper tests and traditional Q&A (no matter how good the questions are) just don’t cut it. There is no excuse.

Assessing Linux (or Security) Admins Onsite

Prior to my current role I’d only managed security consultants/engineers so interviewing operations folks was new to me. We had no “sysadmin test,” so I came up with one. I started by creating exercises for them to do, but I’ve settled on a completely extemporaneous format based on simple tasks. Or tasks that appear simple, because there are some peculiar things about the machine (that I won’t give away) that force them to do some troubleshooting and ask some questions, which I’m usually quite happy to answer.

Of course, I explain to candidates that they have man pages Google (or whatever) to find the answers or they can ask me. I may or may not know the answer to the question. That is the real world. That is also a test because I’m shocked when folks don’t even know how to properly research error messages or even use Google.

The first hour of onsite interview the candidate spends on a VM doing various things I come up with on the fly. I used to be a technical trainer so this is easy and fun and gives me a taste of the classroom. Perhaps something like installing a service and configuring it. Fire up tcpdump and tell me what you see and why.  Reconstruct our network based on packet captures, etc.  Sometimes I break things on purpose. For security candidates, I may give them credentials to our Nessus scanner and tempt them to do scans of systems. Sort of like “playing chicken.” It all depends.  Sometimes things break accidentally and it makes the interview process much more exciting for all of us. Or I may not even know how to do the task myself, or haven’t done it recently. It is all good. One of the more interesting interviews was when an RPM database got corrupted when YUM crashed because I hadn’t allocated enough memory for the VM. Happy times as the candidate scrambled to fix the corruption so he could install they package he needed.

Usually, I can determine if a candidate’s technical skills are up to par (relative to their salary expectations and resume, of course) in 19-15 minutes. If somebody has supposedly been doing Linux for over a decade and they don’t know how to restart a service, we have a big problem.

In the new year, I’ve taken this to a new extreme. I don’t even want to bring somebody onsite (even if they are local!) until I’ve seen them at the keyboard. I don’t ever start the standard interview questions until I’ve seen them perform. It just isn’t worth anybody’s time and I’ve also had too many folks that “sound good on the phone” fall flat on their faces during the first 15 minutes.

A Simple Hands on Linux Phone Screen

Last year I ran across this blog on building a DevOps team where the author described how a Sony hiring manager uses EC2 instances to have them do something simple like install WordPress or some logfile analysis. I personally think this is overkill for a screen, so I do something a little less formal, but in the same spirit.

So the first test for a candidate is whether they can send me a properly formatted SSH key that I can put on Rackspace/Joyent Cloud server or an EC2 instance. I tell the recruiter that the candidate needs to generate an SSH key to login into the cloud within 24 hours of the scheduled interview time. Will they send me a putty key or a real SSH key? I’ll cut a junior candidate some slack if they send me a Putty key that has not properly been converted.

This also tells me whether they use Windows on their laptop, which is a “tell.”  Not a dealbreaker, but definitely something to consider if you are interviewing for an Open Source shop. Seriously, if you aren’t running Linux or OSX on the desktop?

This also starts the email communication with the candidate, which is another test. Since I’m typically working through a recruiter, I’m often insulated from the candidates during the initial communication, so this is my first chance to see how well we interact. Or not.

You can tell a lot about a potential candidates professionalism and attention to detail in these initial innocuous exchanges. Depending on their resumes (and my inclination) I may ask them if they prefer a Debian or RHEL based distro. It all depends. I send virtually no instructions apart from from the IP address/hostname to connect to and the command  to connect to screen session. This is intentional.

Initially, this was just hiring manager laziness, but now I think it is the right way to proceed. During the phone call everything is verbal. If they can’t understand my instructions and follow-up questions or I can’t understand their answers, we have a problem.

We might as well find that out sooner rather than later, instead of during an outage or an upgrade gone wrong.

I did two these interviews this past week and they take maybe 30 minutes of my time to setup and confirm. They add a little stress to interview for both the interviewer and the interviewer, but they are worth it. One of them candidate was dialing in from a Panera which was also interesting and provided “additional realism.”

These hands-on screens take a little longer than the “classic” 20-30 minute screening interview, but they save a that time and expense of not catching a critical in a superficial phone screen is worth it, given the amount of time an interview sucks out of HR and the hiring team. Furthermore, ever single interaction with a candidate allows the hiring team to identify red flags–or allow the candidate to excel.

The other reason I’m excited about these (and will continue them) is they also allow me to give more challenging questions (like coding/scripting) assignments during the onsite instead of an initial UNIX assessment, which I look forward to doing.

The key lesson for me has been don’t need to come up with elaborate/complex tasks as long as you are able to come up with good questions about why the candidate does (or doesn’t do) something on the fly. It also helps if you have more than one interviewer because it allows you to take a break, besides giving the candidate greater exposure to the team.

So for your nest sysadmin or DevOps interview, stop procrastinating, and just do it!

My Tech Learning in 2011: Up, Down, or Sideways?

Looking Back on 2011

For me, technical blogging is about jotting down (mostly so I remember) the things I have worked on (or somehow managed to learn) so I can refer to them later. But it is also a way to keep score.  What knowledge did I gain? What skills did I maintain? What skills did I leave behind or let grow stale? It is a way to document a living knowledge portfolio. Looking over what I wrote a year ago, I expressed concerns about “getting soft” technically. I knew this was going to be a challenge in the new year, and it, indeed, turn out to be the case.

This was one of the reasons I made a job change mid-year to get back to a more hands-on operational role — away from leading a team of consultants that focused on Energy Cyber Security — where too much of my time was spent on the proposal churn necessary to keep a consulting practice running on all cylinders. Not that there is anything wrong with writing proposals and pricing work. In fact, one of the most rewarding professional experience when I was at SAIC, was delivering the “orals” as part of the proposal team in downtown DC for a large Federal agency for a multi-million dollar contract. I worked with a first-class proposal team and we nailed the presentation. I never did find out if they won the contract, but there is also nothing like the satisfaction of having to priced a solution, won the bid, and delivered it to a satisfied customer. But enough background, what did I learn in 2011?

Compliance & Audit

Much of January was spend diving headfirst into the NEI 08-09 (basically the Nuclear industry’s version of NIST 800-53 control catalog) standard in an attempt to help out some of the engineers/analysts on my team that were onsite at a number of Nuclear plants around the country. Frankly, I did not enjoy this Nuclear compliance work for a number of reasons and not just because it was compliance work, but fortunately I was able to escape this work by March, although it did give me some frame of reference for the earthquakes hit Japan.

For another project, I continued to work on justifying and documenting compliance with NERC CIP 5 and CIP 7 based on a technical solution I had architected, and that would end up being deployed a few months later. The last compliance/audit work I did over the summer was as baseline assessment of NIST 800-53 controls. I’d seen other folks do this on projects, but had never done it myself. I was pleasantly surprised that this process was actually useful in developing a security roadmap for an organization. Even outside the Federal government, there are worst places to start than the NIST 800 series, especially if you don’t take all the requirements literally–meaning interpreting the “spirit” of the guideline if not the “letter of the law.” You can’t let compliance get in the way of your security objectives.

Welcome to Agile

In May and June, I read most of Clinton Keith’s Agile Game Development in preparation of my new role in the gaming industry. While “Agile” is certainly not a panacea, I’m generally positive about participating in standups, grouping work into stories and sprints, stuffing things into backlog, and living in JIRA. Most importantly, noting the transparency and improved communication that it fosters across engineering teams. Agile security? That is still a work in progress, maybe I’ll have more to write on that in the coming year.

Virtualization

For anybody in IT, it is probably no surprise that every single project I’ve been involved in this year has been touched by virtualization. I continued to use VMWare ESXi during the first part of the year but when switching to a pure Linux environment I began to explore more interesting platforms such as OpenVZ and LXC. I also finally started to feel comfortable with KVM, which I can finally say on Ubuntu 11.04+ is reasonably stable enough to use, although for anyone that has tried to use FreeBSD on it, they know performance is terrible. But my platform of choice still remains OpenVZ on CentOS 6, although I’m running quite a bit of LXC at home for my wife and kids. Each of my kids has their own Squid instance running on LXC and it just works. And on horrible ancient hardware.

Ops Tools

This year, I also dove headfirst into the land of Puppet, which I most certainly have a love/hate relationship with. which is why over the Thanksgiving break I spent so much time with SaltStack and hope to return to it at some point. Cool stuff with a lot of potential. I was  hoping to give Chef a try but I only got as far as playing around with Chef Solo. Selecting (or migrations to) a new configuration management tool are painful, so choose wisely. Monitorig & trending you can’t for get that. The last time I was in Ops, I had used Cacti and Cricket (yes that dates me) but in my new role I learned Munin, Nagios, Zenoss, and also have collectd running here and there. I wish I’d had time to try out Graphite. Perhaps in 2011, although I doubt it. I briefly played around with Orchestra, but one of my sysadmins ended up doing all of the real work once I figured out you had to install the PXE firmware to get it working with KVM.

The Cloud

I actually experimented with most of the major IaaS platforms this year, which I’m pleased about. Even Azure, of all things. It is amazing what is at the power of your hands if you are willing to pass over your credit card a spend a few bucks. Next year I’d like to spend more time on the PaaS front and I’ve been wanting to get CloudFoundry up and running. I learned how to automate EC2 instances using Python Boto and Rackspace VMs using Python CloudServers and started trying to get Joyent’s Node based API working but the clock ran out, although I did squeeze some time in with PaperTrail and would have gotten farther if it weren’t for blasted rsyslog. Once you’ve used syslog-ng, it is hard to stomach anything else.

Coding

Despite my quest to find something useful to do with JavaScript, 2011 was a mostly Python year. In the Spring, I managed to spend a fair amount of time using Python and MongoDB reusing some code that I’d started on back in 2009 and early 2010 and in mid-year I wrote some APIs that wrapped vzctl, which was the most sustained coding project that ran on anything important since 2006. Besides fooling around with Cloud API’s the only real serious Python work I’ve done since then is with Fabric, which I managed to get other folks to start using (and evangelizing). FTW, as they say! And yeah, I unfortunately had to write some shell scripts in 2011. Lastly, this year I  learned that I hate Perforce (relative to my experience with git or subversion) and that is unlikely to change.

Hardware

If you work in even small datacenter (or lab) you have to deal with hardware, whether you like it or not and I generally do not. Disks (and their RAID controllers) and power supplies that fail and that is annoying. Dealing with vendor support is worse, even if they are decent. I learned far too much about Dell and HP rack and server hardware and spent way to many hours pricing various combination of equipment and worrying about the impact of the Thai flooding on SAS drives and lead times. Unfortunately this will continue in 2012, from what I hear.  On the bright side, 2011 was also the year I moved back to primarily using a Mac laptop (the first time since 2007) and I have to admit that I love my 13″ MacBook Pro and am getting used to Lion–as well as an iPhone. At home, I bought a new desktop PC for the first time for 2-3 years and ended up going with  an AMD-based Dell XPS 7100, which I mostly use for World of Tanks, but my son plays more interesting games like Skyrim on it. Last but not least, I bought a Nook Color which I rooted (running CM7 on right now) and was a fun experience, almost bricking it several times. In hindsight, I should not have bought this. Don’t care much for tablets, but it kept me from buying an iPad (or the new cheap Kindles which I was lusting over tonight) and it has given my kids a platform to play Angry Birds, on so it is is not a total waste of money.

Operating Systems

Although I first used CentOS back in 2007 (and had to used CentOS 4,5 while at Tenable), this was the year I grudgingly began to respect it, or at least when CentOS 6 finally came out and they got their act together releasing 6.1 and 6.2. Much of this respect is because it is a first class platform for OpenVZ and MySQL, not because I like the way RHEL does things. Ubuntu 11.04 Server is still my favorite server OS if I had to deploy something although I’m excited about the upcoming release of the next 12.04. Built Debian packages again (for the first time since 2005) and learned to love FPM. I also found out that FreeBSD 8.2 hasn’t changed much since FreeBSD 6.2. Felt the same. Even built a custom kernel. It felt good to be home. And I learned that building RPM .spec files was relatively painful compared to .deb’s, but that is not surprising because most things are more painful on RHEL derivatives than Debian derivatives.

Network Security

Lastly, I actually still did some network security work this year! In the beginning of the year, I architected a solution around Tenable Security Center again (4.x this time, and it was good to see how the product has come along) and continued with ScreenOS 6.x on low-end Juniper SSG’s for some deployments that started in 2010. I used Snort again, for a project, actually compiling it from source and building RPMs for RHEL5. Imagine that. I learned to appreciate Bro some more and have seen the improvements and also used Nessus 4.2 again and even wrote some .audit files to help identify non-puppetized systems. Nessus has also come a long way since the late 1990s when I first used it.  I continued to use PF (first on OpenBSD, then migrating back to FreeBSD on my home systems) and once again built some gigabit firewall pairs with FreeBSD 8.2 with. Top bandwidth 940 Mbps. Not bad for Core i3 on commodity hardware.

Looking forward in 2012

I won’t make the mistake of trying to come up with specific projects I want to accomplish like I did in 2011. I actually opened up some issues on my Google Code page, which I never made any progress on.  Writing an NSE file in Lua. Silly me, although I was amused to see ZMQ on there, given that I ran across it again with SaltStack. I also won’t make the mistake of saying I want to tweet less and blog more because it probably won’t happen.

But damnit, this year, I will learn enough JavaScript to be dangerous and I’m getting started now!

The Sweet Simplicity of Spinning up a Box on the Rackspace Cloud (in Python)

Previously I blogged about how you can use Python Boto to programmatically spin up a Linux box on EC2. Here are the [more straightforward] steps to doing something comparable with Python Cloudservers.

Get the API

I was lazy I pulled it from the repos on Oneric

mfranz@mfranz-oneric32:~$ dpkg -l | grep cloud
ii python-rackspace-cloudservers 1.0~a5-0ubuntu2 client library for Rackspace's Cloud Servers API

Connect to Rackspace

You should have you API key from the web UI.

I did this all from within bpython without even looking at API docs. It is that easy.

>>> import cloudservers
>>> cs = cloudservers.CloudServers("username","apikey")


Get your Images and Flavors

For the sake of simplicity I just picked the first OS (Ubuntu LTS) with only 256 MB of RAM.

>>> i = cs.images.list()[0]
>>> f = cs.flavors.list()[0]
>>> i
<Image: Ubuntu 10.04 LTS>
>>> f
<Flavor: 256 server>

But there are a number of images you could choose from

>>> cs.images.list()

[<Image: Ubuntu 10.04 LTS>, <Image: Windows Server 2008 R2 x64 - SQL Web>, <Imag
e: Windows Server 2008 R2 x64 - MSSQL2K8R2>, <Image: Windows Server 2008 SP2 x86
>, <Image: openSUSE 12>, <Image: Windows Server 2008 SP2 x64>, <Image: Red Hat E
nterprise Linux 5.5>, <Image: Windows Server 2008 SP2 x64 - MSSQL2K8R2>, <Image:
Red Hat Enterprise Linux 6>, <Image: Ubuntu 11.10>, <Image: Fedora 15>, <Image:
Gentoo 10.1>, <Image: Arch 2010.05>, <Image: Windows Server 2008 SP2 x86 - MSSQ
L2K8R2>, <Image: CentOS 5.6>, <Image: Ubuntu 11.04>, <Image: Debian 5 (Lenny)>,
<Image: Debian 6 (Squeeze)>, <Image: CentOS 6.0>, <Image: Windows Server 2008 R2
x64>, <Image: Fedora 14>]

Create Your Server

I named this “first” and passed in the image and flavor I defined above.

>>> s = cs.servers.create("first",i,f)
>>> s.status
u'BUILD'
>>> s.progress
0

I haven’t figured out to have it refresh the status apart from logging in again, but it will eventually update to ACTIVE and 100.

In order to login, you obviously need to find out your IP address and password. Not it uses your hostname as a prefix for you password

>>> s.adminPass

u'firstXXXXX'
>>> s.public_ip

u'50.56.217.100'

One you are done you can do an s.delete() to destroy the VM.  Pretty simple eh?

Some Interesting (Security) Differences Between EC2 and the Rackspace Cloud

A few things I immediately noticed about the Rackspace. First they generate a root password for you, unlike EC2 which you generate a .pem you can use to login to your AMIs. Also there are no security groups. Lastly from a network isolation you see the broadcast and multicast traffic from other VMs. I was shocked to see NetBIOS UDP traffic and some interesting Blackberry Enterprise Server multicast the first time I fired up tcpdump, in addition to HSRP.

The Almost Perfect On-Call Laptop Kit

Anyone who has done on-call work before knows it sucks to be chained to a computer at home to be close to a (hopefully) reliable Internet connection. It also is no fun to lug a largish laptop along to church (or out to dinner) or wherever you want to be trying to have a normal weekend or evening.

So with that in mind I spec’d out a couple of laptop for my team with the following requirements:

  • Built-in 3G card (having to deal with a USB 3G card just won’t cut it)
  • Less than 13″ screen size with decent enough resolution so you can RDP.
  • Less than 3 pounds. Weight matters.

So given I work for a Dell/Mac shop (my main work laptop is a 13″ MacBook Pro) I came up with two options a Dell Latitude e6220 and Latitude 2120 NetBook. Corporate IT sort of balked at the NetBook idea and I didn’t argue too much and I’m actually glad they didn’t want to support NetBook’s.

(If 12″ MacBook Air’s had built-in 3G that would have been a possibility but having Windows is a good thing given how Dell/HP and other embedded server/management apps sometimes don’t work so well on Mac or Linux. Been there. Done that.)

So, for comparison,  you can see the Dell sandwiched in between my ThinkPad X60s and 13″ MBP. It is slightly thinner than the Mac and weighs about as much as the ThinkPad, but is slightly wider. It actually fits fairly well in 13″ cases or sleeves as you can see it is slightly smaller than my Booq Vyper case that I use on my MBP.

Notice the largish power supply which is one of my complaints. I also hate how Dell AC adapter are 3 prong grounded, whereas Lenovo are two-prong. We live in an old house and our main floor outlets aren’t grounded.

My other main complaint is that it only has a touchpad. It has a Core i5 with 4GB, so it is decent enough for running a VM or two under VirtualBox. I got the smallest battery so battery life isn’t great (2-3 hours) but it is good enough.

So my next decision was what sort of case. I wanted something small so that you could put it in a backpack, yet something that could go standalone. I ended up finding 13″ Timbuk2 Quickie bag at the local Best Buy that was exactly what I was looking for $29 on sale. It could use a little more padding, but it is decent enough.

A First Taste of Salt Stack States

What is Salt?

Although Salt Stack is, first and foremost, a super-cool new ZeroMQ-slinging command execution framework written in Python that makes Fabric seem weak and pathetic (sorry guys, I do still love you, but you should be jealous) one of the features I’m most interested in is states.

This is why folks use Puppet and other CM tools. Instead of defining how you do something on a set of servers (perhaps through coding up nasty shell script) you define what the end state should look like: a given package should always exist on a certain system, a file should have a set of permissions, a user should be created, etc.

Creating your file repository

By default, Salt Stack uses /srv/salt as your file repository, so I created the following:

The first file you should create is the top.sls which is roughly comparable to the nodes.pp in Puppet.

Instead of having to create a class that is inherited by all nodes, you just create a entry with ‘*’ to apply to all nodes, to apply two additional state files: resolver and scanner. The pfmon.sls will only apply to hosts that have “bsd” in their hostname.

Just like it sounds resolver.sls manages your resolv.conf

The other state file is a little more interesting. If you’ve used puppet, you should be familiar with the concept here, but the templating language is Jinja. Because tcpdump is part of the base OS on FreeBSD I didn’t include it on the FreeBSD node. Nmap is the name of the package on all OS’s so I left that, but xprobe has a different package name depending on the OS.

But there are some differences. Salt doesn’t require you to provide a template suffix (like .erb) and your state files (meaning the .sls) can use the template language as well as your configuration files. And, and least in this first example, they are all in the same directory.

Applying Your States

But if you are used to puppet, there are a couple of things you’ll find unusual about Salt. First, when you connect your minion (client) to the master (server) nothing happens until you tell it to.

It does not automatically do a “run” so you need to tell it to:

What I did here was a ping of all the minions by running the salt client (on the master) and sending the “test.ping” command to your minions and they report back. Then I send the “state.highstate” command which causes the minions to compare their local state to what is stored on the master, comparable to when you tell you puppet client to check in to the puppetmaster. So this is pretty boring, because I had run it previously and everything is already installed. NOTE: this is on the master.

And something else odd (that at first I first did not understand) was that results that were reported by running the salt command were different each run. Sometimes, just one host would come back. What was happening was the changes were occurring on the minions they just weren’t being reported back because Salt uses a non-blocking publish-subscribe communication mechanism. So I had to increase the timeout (with the -t option) to ensure the client waited long enough to get the results from the minions, but the actions were always being performed on them.

Another major architectural difference is that the Salt Stack uses a persistent TCP connection (or two) between the master and the minions vs. checking in via HTTPS the way that puppet does..

Apart from tailing the logs (/var/log/salt/master and /var/log/salt/minion) if you run into trouble, you can also perform the operation on the minion. If your states, are failing, I recommend this. This is roughly comparable to a local puppet run, or by using chef-solo.

Where to go learn more?

Although the documentation is pretty good for such a young project, there are still some gaps, but there is already small but vibrant community available on IRC and Google groups. There is also a great (and quite long and in-depth) webcast with the developer.

Optimizing OpenVZ Templates for Fast Container Creation

Like most defaults, the precreated images for Ubuntu and CentOS available are suboptimal, especially when you starting up large numbers of them. This is particularly true if you are passing commands into /etc/rc.local (perhaps to install packages or kick of a deployment from a customer APT/YUM repo) since rc.local is the last thing to run.

1. Remove or disable unnecessary packages such as apache, samba, sendmail

On CentOS disable sendmail with chkconfig, because if you try to remove the sendmail RPM you will remove cron, which is a bad thing.  If you don’t your boot process will take 4-5 minutes and your commands in /etc/rc.local will appear to never execute and you’ll see something like…

tcpdump: WARNING: venet0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on venet0, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
06:13:10.996733 IP 192.168.10.127.54066 > 192.168.1.1.domain: 6943+ AAAA? ve127. (23)
06:13:16.001843 IP 192.168.10.127.54066 > 192.168.1.1.domain: 6943+ AAAA? ve127. (23)
06:13:21.006956 IP 192.168.10.127.54066 > 192.168.1.1.domain: 6943+ AAAA? ve127. (23)
06:13:26.012069 IP 192.168.10.127.54066 > 192.168.1.1.domain: 6943+ AAAA? ve127. (23)
06:13:31.017249 IP 192.168.10.127.43303 > 192.168.1.1.domain: 2945+ A? ve127. (23)
06:13:36.022374 IP 192.168.10.127.43303 > 192.168.1.1.domain: 2945+ A? ve127. (23)
06:13:41.027486 IP 192.168.10.127.43303 > 192.168.1.1.domain: 2945+ A? ve127. (23)
06:13:46.032598 IP 192.168.10.127.43303 > 192.168.1.1.domain: 2945+ A? ve127. (23)
06:13:51.037759 IP 192.168.10.127.43358 > 192.168.1.1.domain: 37637+ MX? ve127. (23)
06:13:56.042859 IP 192.168.10.127.43358 > 192.168.1.1.domain: 37637+ MX? ve127. (23)
06:14:01.047944 IP 192.168.10.127.43358 > 192.168.1.1.domain: 37637+ MX? ve127. (23)
06:14:06.053065 IP 192.168.10.127.43358 > 192.168.1.1.domain: 37637+ MX? ve127. (23)

2. Disable IPv6 if you don’t need it

Within /etc/modprobe.d/blacklist.conf

blacklist ipv6

And within /etc/sysctl.conf

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

The AAAA lookups are a killer, easily adding 30-45 second on bringing up networking. Remember every performance problem is a DNS problem. If only we didn’t have to use name resolution the world would be a better placed.

3. Remove pre-generated SSH keys. 

These will increase your boot time, but the Ubuntu containers do contain pre-generated SSH keys for the host, which is a bad thing.

4. Rebuild you container templates with faster (but larger) gzip encryption.

One of the sysadmin’s on my team came up with this one. I haven’t timed this, but when creating containers, there is a quite a bit of IO as the container templates are built. This is the key bottleneck.

 

Walking Debian Package Dependencies with Python-Apt

So a something that I’ve needed to figure out of the last few days is how to get the complete list of dependent packages for a given installed packages.

I figured there had to be a Python library for this and using this blog as a starting point I wrote up this script.

mfranz@opti620u1104:~$ ./deplist.py tcpdump
Required packages for tcpdump
['libc6', 'libc-bin', 'libgcc1', 'multiarch-support', 'gcc-4.5-base', 'tzdata', 'debconf', 'perl-base', 'dpkg', 'libbz2-1.0', 'libselinux1', 'zlib1g', 'coreutils', 'libacl1', 'libattr1', 'xz-utils', 'liblzma2', 'debconf-i18n', 'liblocale-gettext-perl', 'libtext-iconv-perl', 'libtext-wrapi18n-perl', 'libtext-charwidth-perl', 'libpcap0.8', 'libssl0.9.8']

I’m assuming there is not a non-recursive solution to this?  And you do I get .py to render on WordPress so I don’t have to do screen shots?

Follow

Get every new post delivered to your Inbox.