Why I'm Moving Away from the Play Framework

I've been using the Play framework since I started at RtR 3 months ago. Last week, I made the decision that no new services will be written in Play from that point forward. It started out as a great little framework that was pretty quick to learn and easy to use, but it's turned into something that I would not recommend anyone use for serious production applications. What happened?

First, I lost faith in the developers.

One of the first things that annoyed me about Play was the inability to run a single test from within a play test class inside your IDE. I suppose the thinking was that you will always run the play test app, or something, but I prefer to leave my IDE as little as possible when I'm working, and running an entire test class worked fine. So, being the good little open source programmer that I am, instead of bitching I rolled up my sleeves and fixed the bug. It was a pretty trivial fix. I even wrote a test case. Then I put in a pull request and waited.
After submitting the pull request, I commented on the pull request, commented on the ticket, and finally sent an email to the mailing list. And the response I got was basically that the team is too busy working on the next generation of the product to absorb fixes for the older generation. Having worked on open source projects myself, I understand what it's like to have limited bandwidth to look at changes. But if the project team's bandwidth is so limited that they can't even afford to look at small fixes like this, it seems like the project is basically abandoned. At that point I lost faith that I could rely on the community to support the 1.X branch of this product. Not necessarily a dealbreaker, but definitely a bad sign. 

Then, I lost faith in the platform.

We started to hit some serious bugs in the platform during a big push on a complex service. First, our developers that used Mac and Windows hit a bug similar to this, where they just simply couldn't get the app to work no matter what they did. It worked fine in linux, but even a clean checkout would fail to run for them. It was inexplicable, irritating, and we lost a couple of days of development work trying to get around it (rolling back checkins, pulling out modules, poring through stack traces). By this point, I had lost faith in the community, so I didn't see the point in going to them for help. Fortunately, we did finally get around it (it seemed to be a bug in the CRUD module), but we were all really frustrated and annoyed with the framwork after that experience.

Finally, I lost faith in my own ability to debug the framework.

The issues above were enough for me to want to move off of Play for new projects. The thing that caused me to move off of Play for projects that are already in development (but not in production) was this: At some point, we had written a migration job in Play for a major data migration. We discovered the strangest thing would happen. The job would run across several job threads, and at some point, one of the threads would hang. But it would not hang in a way that I have ever seen a JVM thread hang. The thread was in RUNNABLE state, and it was in a HashMap method (either get or put) and it was just sitting there. Not doing anything. No locks, no database or other IO, plenty of memory, plenty of resources, just sitting in that HashMap.get method, hanging out.
Now, maybe you've seen that before (and if you have, please leave a comment!). But I have seen a lot of JVM issues in my day, and this is a new one. There was no reason for this thread to be hung. And yet it was. I can debug just about anything you can throw at me in Java, but I had absolutely nowhere to start looking to debug this issue, except a vague suspicion that it was related to the way the framework was rewriting the classes under the covers. That is a dealbreaker, ladies and gents. I could've probably debugged why the module was causing the app to crap out for my developers, if given enough time. But I cannot say with any certainty that I could debug whatever the hell was causing that thread to hang.

If I felt the developers supporting play were committed to building a real community of support around the 1.X version, I might have stayed with it longer. It's a giant pain to find something else that is easy, lightweight, supports JPA and doesn't force me to write XML. But I can't use a product that I know has issues even I can't debug, and a team I don't trust to maintain the product to the standards my team needs to confidently use it in production.

Three Reasons You Should Be Training Your Successor

It's Friday afternoon of a long week. Despite wanting to pack up your bag and go home, you know you need to review the new hire's code. It's their first week so inevitably it will be some scary stuff for you to untangle. When you get in there, it is scary. But not in the way you expected. Instead of causing more bugs than they fixed, they've actually managed to fix every bug you gave them and in the process redesign a hairy part of the system into something rather... elegant. You want to find fault somewhere, but you can't. They're really good, so good that you're feeling threatened. I want to convince you that this threat is an opportunity. When you find that up-and-comer on your team, instead of trying to prove that you are better than them, train them to replace you. Why? Here are three major reasons:

1) Career Growth 
It's hard to grow yourself and your career within a company if you hoard the knowledge or ownership of a particular project. The people that succeed the best at growing into bigger roles are those that can point to a protégé waiting in the wings to fill the position they are leaving behind. This applies to us tech types that may not want to grow into managers, but want the freedom to move into new projects. It's easy to get stuck on the same codebase when you're the only expert in the area. Training other smart developers on the team lets you move on to new and exciting things as the opportunity arises, and with no guilt about leaving people in the lurch.
Even if you prefer to switch companies every few years, it is always great to be able show a track record of several successful projects per company, and it gives you the opportunity to get a good breadth of experience.

2) Guru Points
Teaching and mentoring others makes us learn what we know even better. I always find that any time I have to give a teaching talk on a topic, I learn that topic a little bit more. Explaining things to others formalizes what you know, and in watching them learn they will almost always teach you something as well. Do you want to grow into the guru of your team or company? Mentor others and teach them all you know, and before long you may find yourself the most trusted advisor of the team.

3) A Stronger Network
Helping others grows your network, and provides you a pool of talent that you can draw on throughout your career. In a big company, knowing people on other teams that trust and respect you makes cross-company collaboration much more effective. If you are the sort to move between small companies every few years, it's likely you will need to hire talent to work with you and make those companies successful, and who better than your former mentees? And if one of those that you mentor grows to become a great success? They are not likely to forget the people that helped to get them where they are, and can be a very valuable part of your professional network.

This post is dedicated to my mentor Mike, who taught me this rule and practically every other useful thing I know. Please leave a comment here or on twitter!

Networking woes in Java

The only major CS subject I never took a class in was networking. It's kind of ridiculous, looking back, that I took as many systems classes as I did but always eschewed networking. I do own a copy of UNIX Network Programming: Networking APIs: Sockets and XTI; Volume 1, bought at some point in the past when I knew I was going to be doing some distributed systems work and figured it would be a useful reference. But I can't say it's been my constant companion. For I have learned one thing in my years of Java systems coding:

Networking code is HARD.

Here's exhibit A: ZooKeeper monitoring misuses sockets. I spent a good chunk of time desperately trying to figure out why my monitoring commands were crapping out halfway through when run from NY to LN. Turns out, you can't safely expect to just close half a socket, leave the other half open, push some data to it and then close it while seeing all the data through to the other side. Not without a final handshake indicating the client has gotten all the data. Or at least, I think that's the case. The thing is, this will work well enough over a very fast network connection or with very little data. The guarantees around so_linger etc change kernel to kernel and my reading at the time led me to think that in fact the standard linux kernel behavior in this case may well have changed over the years that ZooKeeper has been around. So we need to completely rip out and redo the monitoring code if we want to have any hope of this working right for other big, global deployments in the future.

Exhibit B is my current debugging nightmare. Part of our release last week involved a new backend Play service that itself connects to a different backend Play service to prepare results for our storefront. We noticed, several hours after launch, that the service started to throw exceptions that were ultimately caused by java.io.IOException: Too many open files. I know enough about Java to know that running out of file descriptors is often a Bad Thing. 

So we're leaking sockets. Why? To date, we don't know. The underlying libraries are async-http-client and netty, but there's very little to indicate what is going on.1 The sockets show up in netstat/lsof as ESTABLISHED TCP connections to the various storefront servers. But the storefront servers do not have most of these sockets open on their end. How are they ESTABLISHED with no partner? It's an ongoing mystery, one that we haven't been able to reproduce on any other machine (the current theory is bad network hardware/software at the lowest levels, but honestly that's just a shot in the dark and one that we can't verify without taking down a production service).

So, while I keep debugging, what are the takeaways here?

1) You shouldn't write your own socket handling code in Java. Really, no. Don't do it. Use Netty. It's very good. Of all the things not to reinvent yourself, I would put networking at the top of the list with a bullet. It's hard, and requires the kind of deep expertise that you can't fake. And, when you fake it, you may end up with something like our ZooKeeper monitoring, that seems to work for years while hiding small but significant bugs.

2) If you're a system architect writing any kind of web services/distributed system architecture, you should know your unix socket monitoring commands. lsof is obtuse but powerful. netstat is simpler and still quite useful. This article has a few others, like ss and iftop. Know how to up the ulimits for your processes in case you find yourself with a slow socket leak that you need time to debug.

Have an idea what my bug is? I'd love to hear it! Leave me a comment or hit me up on twitter!

Edit 2/27: Looks like our bug was indeed on the cloud vendor side; possibly a misconfigured firewall. Moving to a new box and rebuilding the box we were on solved it.

1 Thank God Play is at least using good networking libraries, because the last time I tested ZooKeeper, when it runs out of sockets the service hard fails with almost no indication of what happened. 

The value of physical objects

I led my first successful release at Rent the Runway today. It was a major revamp of our homepage for a segment of our users. My boss was on vacation and the release happened to fall on his travel day, so it was up to me to help guide the team to push it out.

A bit of background. The storefront of our site is a complex, hairy piece of code that currently controls far too much business logic and is very difficult to test and change. We're working on moving logic out of the storefront into a more service-oriented backend framework model, but it's slow going, and we can't just stop all work while we rewrite everything. So we're trying to make these incremental changes that involve many people, integrating in a complex and difficult-to-test front end. And all the Jira tickets and standups in the world didn't seem to be helping us get a handle on what was happening, who was doing what, and where we were getting stuck and missing our deadlines.

Enter the kanban board. We rolled this out Monday with the help of our product management team. There were fears that people wouldn't like it, that it would be an added piece of process that would bog things down further, that the developers wouldn't bother with it. Today proved all those fear wrong. Within the space of 10 hours, we used the board as a quick glance to see who had which bugs assigned, to see how progress was being made, to quickly reassign work, and to triage the remaining tickets to identify the launch-blockers and promote them.

And when we got it all pushed out, the developers took pictures of the board as evidence of all they managed to accomplish in a day.


We live in bits and bytes and screens and dashboards and webpages and text files. But sometimes there's no replacement for a physical thing that you can glance at, touch, move around, circle, and stand in front of with a colleague. Never underestimate that.

Quick Wins: Monitoring Request Times in Play with Coda Metrics

My twitter feed has been abuzz about coda metrics for a while now. I decided to finally bite the bullet and try it out, and the result was a very nice quick win for our code base.

We're still using Play at work, and we have a service about to go into production that we've been monitoring through the oh-so-elegant method of "writing log messages". This is fine, but it doesn't tell you how long various request types are taking on average without doing a bit of log parsing, and I'm not much of a scripter.

Today, I promised that I would provide something slightly better to measure how our various endpoints are doing. Cue coda. I've been looking at it on and off for a couple of days, but kept getting hung up on wanting to do things like use the EhCache metrics gathering (not trivial in Play at first glance). Going back to basics, I decided after some thinking that the histograms would be the best thing to use. We were already grabbing method execution time for logging purposes, so all I had to do was insert that into a histogram and it would track the running times. Simple enough. But I want to create these histograms for each method type, and ideally, I just want to put it into our superclass controller that is already set up to capture the method timings and log.

Fortunately, Play has lots of nice information floating around in the "request" object of its controllers. Using that object, I can see what Controller subclass this request is destined for, as well as the method that will be called on that class. So I have enough information to create the histogram for each method, like so:

Histogram histo = Metrics.newHistogram(request.controllerClass, request.actionMethod, "requests");
Great. But I was a little tired, and thought that I needed to keep these around, so I stuck them in a ConcurrentHashMap associated with a unique key based on the controller class and the action method. Turns out though, if you look in the MetricsRegistry source code, you'll see that in fact you don't really need to do this at all. As long as the "MetricName" that would be generated for your metric is the same, the same metric will be used for the monitoring. Now THAT is the kind of clever code I like to see.

I decided to keep my ConcurrentHashMap around anyway, to save myself the (utterly trivial) overhead of creating the various objects passed in to the registry by newHistogram. The resulting code is embarrassingly simple. So simple, in fact, I wanted to make it more complicated and it took me 3 revisions to realize how little code I actually needed.

Here is the resulting BaseController, on GitHub, in a skeleton Play application.

I'm a bit sleep-deprived, so if I missed something, be sure to leave a comment or hit me up on twitter!

Developer Joy, Part 2

I got some good comments on my last post, including a few submissions. One idea that I intentionally left out was that of working on a project you're passionate about. In fact, I think the point of developer joy is that developer joy has to transcend love for a project. It is the stuff that keeps us going despite working on code with a purpose which we find boring. It is the stuff that makes writing dull-but-important software worthwhile. It's the stuff that, in short, every good tech lead should keep an eye on.

Sadly, most developers are not working on problems that they find intrinsically fascinating. I told myself in coming into computing as a career that one reason I was doing it was that it had application in so many areas. And getting to apply computing to the fashion industry is a problem I find rather fascinating (and never in a million years expected to be solving). But I spent many years in industries that bored me to death to get to where I am today, and I still enjoyed myself immensely doing it so long as I had enough elements of developer joy.

As an illustration of this joy, one of the more enjoyable projects I ever worked on was the distribution of the test runner for a code base I worked on. The code base had thousands of tests, most of which were slow integration tests, and we expected this test suite to run and pass both on every checkin, and before developers checked in code. We'd gotten to a point where tests were taking 4 hours and most of a developer's local CPU cycles. We weren't willing to just get rid of tests, we weren't able to refactor the external dependencies completely to make the individual tests themselves faster (ah, sybase stored procedures!), so we set about to figuring out how to run this build process across a farm of machines. I like distributed computing, but this problem was more about dealing with the distribution system we used, debugging strange transient errors when we pulled formerly-sequential tests apart, and setting up a bunch of infrastructure to get things working.

Why did this project end up being so fulfilling? Well, for one, the team working on the project was incredibly smart. The main developer taught me a ton about the underlying distributed system and in turn we had some great team sessions of brainstorming. The whole project was around testing, and of course we wrote tests themselves for the project components so it was a well-developed system. We had plenty of time to actually work on the problem. And the solution we produced was pretty awesome. In short, it was a fun problem to solve not because I love build systems and tools (a common mis-conception about me, actually), but because I love solving hard problems, solving them well, and solving them with smart people.

Some developers are always going to need projects that they are passionate about. Those developers tend to form startups, go into academia, or find a spot in a big company with the resources to splash on their particular niche. But most of us work on problems that we solve to pay the bills. That doesn't mean there isn't joy in a day's work. When that work involves learning new things, perfecting your own expertise, being able to solve problems and see those solutions delivered, it gives us joy. Tech leads, CTOs, and founders sometimes forget that the people working for them may not have the same love for the project that they do. Keeping up the elements of developer joy helps bridge that inspiration gap.

Developer Joy

I've been thinking a lot about developer joy lately. Specifically, I've been thinking about ways I can increase it for the developers that I work with at Rent the Runway. While walking home today I realized that, while I might have general ideas for things to try to increase that joy, I haven't ever bothered thinking about what it actually IS. So I present a list of things that increase my own developer joy.

1. The opportunity to learn new tools and systems.
If I could have a list of projects worthy of the 3 major NoSQL styles out there (BigTable, Dynamo, Document Store), I would be in hog heaven.

2. Code bases that are easy to check out, compile, run tests on, and deploy.
I love code bases with "eclipse" targets that just make it go.

3. 3+ hour chunks of time devoted to coding.
You know you're getting too close to being a manager when you have to start hard blocking your calendar if you ever want to get anything done.

4. Automated builds that are kept in good shape.
It's a matter of respect for your code and your teammates.

5. Having a team to brainstorm with.
Bouncing ideas off of others almost always hardens them into fighting shape fastest.

6. Source code for all my libraries at my fingertips.
Nothing frustrates me more than being unable to see what the code I'm calling is actually doing.

7. The time to deliver a quality product.
I like to feel pride in my work, and delivering something that works and works well is integral in that pride

I'm sure I'm missing a ton of things, and my list is probably different from your list. Please tell me what I'm missing! Leave a comment or hit me up on twitter.

Edit 2/1: What did I miss? My favorite submissions include:
Being able to point to work done, even on the backend (@lucasjosh)
The ability to get hardware when you need it! (@abeppu)
Opportunities to showcase work outside of the company via public speaking, blogging, and contributing to open-source. (@dblockdotorg)
And this very eloquent comment from my long-time mentor, Mike, on my G+:
" Remember there's the joy of stylistic expression in code. After the architecture white-boarding, after the design arguments (if there were any), then it's you and your editor (and your 3+ available hours to code) - and the joy of just saying it in code. Saying it well has always been so important to me."

Keep it Simple, Dingus

Earlier this week, I took a few devs from my company over to visit Art.sy. dB had offered to talk front-end technology choices and techniques with us, and it ended up being a very useful overview of lessons that they learned in building out their incredibly beautiful UI.

Much of the content went over my head. As you may be able to tell from the ultra-boilerplate layout of this blog, I'm not much of a front-end developer. But one thing that dB said in passing really hit home. When asked why he had chosen to use (or not use) some technology, he made the point that he wanted to keep the set of different systems he was using very limited to begin with because he thought that the architectural complexity overhead would be bad for a small startup.

I've been thinking a lot about that idea this week as I build out a new back end system to host our product data. We've already decided to go with MongoDB for the storage layer. It's easy to set up, supported by Play modules, and our ops people know how to administer it. Most importantly, it works very well for the type of data that we are sticking into it. Everything about the project has gone smoothly, and so over the past couple of days I've been trying to figure out how to do generic text search over the data. I want to use something that has all the goodies of lucene built-in, and I don't think that the MongoDB text search is going to quite do it (although using Mongo to provide fast filtering has been a dream).

I quickly found myself down a rabbit hole. Solr couldn't immediately parse the bson that Mongo spit out for our documents (not surprising), and I didn't find any easy translators online. I started looking at what modules existed in Play to enable integration with a search system and came across elasticsearch, then fell further down the rabbit hole trying to get the module to support not only JPA objects but morphia-generated Mongo objects. In the process I re-read articles and notes on the idea of using solr as the storage layer with no other backing store, was pointed at SenseiDB, and generally began to feel a sense of despair at the complexity of it all.

I like systems. I like to learn systems, to understand their strengths and weaknesses, to read their code. In the 6 or so weeks I've been on this new job, I've had to resist the urge to push immediately moving to a Hadoop-based analytics platform, our own distributed file system for image stores, possibly a different nosql for our user data. I realize now that I need to fight that urge even more. Using the perfect tool for every job incurs an administrative overhead and attention thrashing that an infrastructure team of 3 cannot possibly hope to manage well.

So tomorrow I'm going to take a step back and think about how I can simplify my text search problem. Maybe the answer is to just do it in Mongo for now, and save the feature/complexity tradeoff for another day. One thing is certain: right now, it's more important that I become an expert in the systems I have than pick the perfect technology for each tiny problem.

Framework Developers, Application Developers

I was chatting over drinks with a buddy of mine (All Things All Things, aka Joe Stein) the other day, and we both agreed that we were annoyed with open source frameworks that seemed like they were built by people that never had written applications using said frameworks, and sometimes by people that seemed to have never developed applications at all. I've been both an application developer and a framework developer, and I can say without question the worst job I've ever done with a codebase was the case of working on a framework that I never used and didn't originate myself. Why does this happen? I'm a good developer, but I'm not immune to the common pitfalls of framework/library development.

Pitfall 1: Never running a feature in a real application
I think this is a very common problem of frameworks developed by people that aren't actively using them. You think of a cool feature, or maybe some user asks you for one, and you spec it out and implement it. You hopefully write some good unit and integration tests, and everything seems to work. But of course, you neglected to test things like what happens when the whole system is rebooted and the state of this feature changes. Especially with certain kinds of features you can build it half right and have it silently fail for a long time before anyone notices. Quotas in ZooKeeper are an excellent example of this: a monitoring feature that worked until the quota was written to snapshot, and didn't seem to be used by any of the maintainers of the project. (cf this not very descriptive jira)

Pitfall 2: Never having to test application code that uses this framework
I'm hitting this a bit in my usage of the Play framework. It's a framework that did have a lot of testing features built into it but... they neglected to implement Filterable in their Junit runner, so you can't run a single test out of a class in your IDE. I submitted a fix for this feature a few weeks ago that has been withering on the vine, despite the fact that this is an incredibly annoying thing to overlook and a trivial thing to fix. The framework also doesn't support changing the http port on the command line when running tests automatically. Why would you need to, unless you happen to have a code base with several active branches in development that are also being automatically tested as I do right now. The framework developers may never get bitten by this, but it's definitely an annoyance as an application developer using the framework.

Pitfall 3: Throwing in everything and the kitchen sink
I recently saw a retweet asking why the hell Guava would add an Event Bus feature. Does that really belong in a collections framework? When your whole life is the framework you're developing, sometimes no feature seems too small or too unrelated. Unfortunately, putting in too much for the sake of completeness can make your code harder for application developers to fully grasp. If I have several different subtle variations of a method, with slightly different argument lists, I have to constantly check the javadoc and stop to think every time I try to use your library I'm likely to use it less, or just find one way to do it and always do it that way. I will, and have, rejected libraries in the past on the basis of being overly feature-laden. I don't always want or need complexity, and I'd frequently rather work around a small missing element than spend my life searching for exactly the method I want to call.

Pitfall 4: Making your library difficult to read and debug through
When you coat everything in layers upon layers of indirection, reflection, deeply nested interface hierarchies, and painful call graphs, it's hard for your users to figure out what the hell is actually going to happen, and painful to debug through the code when something goes wrong. I can't possibly be the only developer that learns libraries half by reading the documentation, and half by just calling the method that seems right and reading through the code when it doesn't work. This is largely why I absolutely despise Fluent-style development. When it is done perfectly and just works (as in perhaps the case of something like Mockito), it's verbose but acceptable. When it's in a place where there are lots of links in the chain where something could go wrong, it is an absolute nightmare to read and debug. I'm keeping the call stack of my own application in my head, please make your library as easy as possible for me to add to that mental complexity.

The best way to get over most of these pitfalls is to have at least one person on your framework team that actually uses the framework you're developing for something else. Barring that, listen to your users carefully. When they are confused by how to figure out what to call, frustrated by the difficult of debugging, or complaining about the difficulty of testing your framework, these aren't problems to treat lightly. Remember, your framework succeeds or fails based not on it's own internal merits, but on how many people actually use it to develop other code. Application developers are a framework developer's best friend.

(E)Git Pain, Git Joy

Pain
I've been using git in anger for about a week now, after we migrated our repos at work to github. I thought that after half a day of struggle, accidental bad merges, and confusion I finally managed to get the hang of EGit. Lots of right-clicks (hey, it's Eclipse so whatever), remember to commit, then push, great.

Today, a coworker and I do a little pairing, write the skeleton of some new features. I commit and push my changes and leave to go to a meeting that ends up going for 2 hours, only to return to a request that I add a new file that had gotten missed in the commit. Why, why, why do I need to ADD this file? I understand that commandline git (and svn) require "add" before commit, but using Subversive for the last few years I've gotten out of the habit because, seriously, if I highlight it to be checked in just add it for me.

I also seem to keep hitting problems with merging when I've made changes to files that have subsequently been changed. No matter how unrelated and auto-mergeable the changes are, EGit doesn't seem to let me pull them.

These may (probably) be user error but it is so tiresome to be yet again at a place where the tools have not caught up with the cool new thing and don't bother to streamline the common case.

Pleasure
On the other hand, we had a production issue tonight. Right now, the production release process for this particular code is "check out trunk, restart" (yeah, yeah, I'm working on it, it's only been a month ok?). At some point I realize that this change is bad enough that I just need to go back in time to where the code was more stable, and deploy from there. Despite a lack of tags or branches involved in the current process, this was quite easy to do.
git branch rollback_0106
git checkout rollback_0106
git reset --hard <checkin>

Then push the branch, and release from that branch. It's incredibly fast, and very easy. I also was easily able to create a branch with some failing tests for my coworker to look at, without much hassle or annoyance.

So the jury is still out for me, but really I suspect the problem is just that the tooling has not caught up to the technology. Sadly, it seems you only get a little time in the sweet spot of good tooling and good technology before being forced to move to the next hot thing. I guess this is why most people just stick with the command line...

2011: My Year of Open Source

2011 saw a lot of big events in my life. I got hip surgery early in the year. I found myself thinking of leaving my job of six plus years in the summer. I actually left that job in the fall, and took the big leap into startup land in November. But when I think of 2011, I think the biggest changing influence for me was my entry into the world of open source.

I would call my evolution as a developer a three phase project. First, getting all the fundamentals of computing beaten into me in undergrad and graduate school. Second, learning how to be productive in the working world, the gritty details of actually producing production code and solving problems that sometimes are purely technical but often are a matter of orchestration, attention to detail, and engineering. Finally, combining these two aspects, and putting these talents to use in something that touches developers all over the world.

I fell into the ZooKeeper community by happy chance. I had been given a project to implement a company-wide dynamic discovery service. The developers that had come before me had found ZooKeeper, but had the luxury of implementing a solution that didn't have to scale to the volume and geographic diversity of the whole firm. I had requirements for global scaling and entitlements that didn't seem to be common in the ZooKeeper community at that point, and so I was forced to do more than just comb over the documentation to design my system. I cracked open the source code and got to work learning how it really worked.

My first bug was a simple fix to the way failed authentication was communicated to the Java client library. I had to get approvals almost up to the CTO level to be able to participate, but it was worth the effort. Quickly, I started feeling more responsibility to the community. I was, after all, relying on this piece of software to give me a globally available 24/7 system, and I wanted to be able to support clients where downtime could mean trading losses in the millions. I owed it to my own infrastructure to help fix bugs, and really, it was fun. I love writing distributed systems, and the ZooKeeper code base is a pleasure to work with; a little baroque, but just enough to be a fun challenge, and most of the complexity is in the fundamental problem. 

Working in the community has not just been about fixing hard bugs. It's also been about those engineering and teamwork considerations that are beguiling on a co-located team, and working on a team that I communicated with entirely though email and Jira was intense. Lucky for me, the ZooKeeper community is some of the most mature engineers I've ever had the pleasure of working with. We pull together to solve hard bugs, we all participate in answering questions and we try to make everyone feel welcome to participate. I consider the community to be the textbook example of open source done right.

Working with this community, working in public after being sequestered in the tightly controlled environment of finance for so many years, flipped something in my brain. I realized that I wanted to be able to live out loud, as it were. I value openness, the ability to work with people all over the world, the ability to work in public, getting feedback and appreciation from the wider community of developers. It also gave me confidence that I could be productive outside of the comfort zone of the place I had worked for years, and that I could show a degree of leadership even without an official title.

In the end, this experience freed me from feeling tied to the corporate life I had been living. I feel open to choose the path I want as a developer. The startup world of today has embraced this open source mentality, which I think is one of the most exciting developments of the last five years. So, I chose to go to a new job that I knew would let me live out loud. 

If you're not already in the open source community, why not crack open your favorite open source project and make 2012 your year of open source? 

A quick one: Testing log messages

I'm writing a talk on unit testing for work, and it reminded me of one of the coolest things I learned from the ZK code base with respect to testing: testing log messages.

You probably don't want to rely too heavily on log messages for testing, but sometimes it's the only indication you have that a certain condition happened. So how do you test it?



    Layout layout = Logger.getRootLogger().getAppender("CONSOLE")
                .getLayout();
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        WriterAppender appender = new WriterAppender(layout, os);
        appender.setImmediateFlush(true);
        appender.setThreshold(Level.INFO);
        Logger zlogger = Logger.getLogger("org.apache.zookeeper");
        zlogger.addAppender(appender);


        try {
...
 } finally {
            zlogger.removeAppender(appender);
        }

        os.close();
        LineNumberReader r = new LineNumberReader(new StringReader(os
                .toString()));
        String line;
        Pattern p = Pattern.compile(".*Majority server found.*");
        boolean found = false;
        while ((line = r.readLine()) != null) {
            if (p.matcher(line).matches()) {
                found = true;
                break;
            }
        }
        Assert.assertTrue(
                "Majority server wasn't found while connected to r/o server", found);
    }

From ReadOnlyModeTest. Kudos to Patrick Hunt (@phunt), the original author.


12/29 Edit:
Some folks have found this cringe-worthy. I agree. This is not a testing method that should be common in any code base, and for goodness sakes if you can find a different way to ensure something happened, do so. But there are a few times when this kind of test splits the difference between expedience and coverage (ie, I'd rather write a test to validate a log statement then just make the change and observe the log, or refactor the code base to expose some fact that the conditions causing the log were met in order to be able to test it).

Effective performance testing

One of the major challenges I have faced in my career is keeping performance up in a large, rapidly-evolving system with many developers. As a Java developer writing enterprise and web applications, I generally have the luxury of not worrying about the minutiae of working within the L1 cache, or having to drop down to assembly to optimize my performance. But even enterprise applications have their performance limits, and users don't want to wait 10+ seconds to have results come back to them, or watch their website freeze due to stop-the-world garbage collection. In the land of e-commerce web apps, performance is even more essential. The bounce rate may double when a result takes 4 seconds vs 1 second, which means dollars we will never realize. So I take performance seriously.
We've all heard the old chestnut "premature optimization is the root of all evil". And indeed, when you find yourself spending half a day worrying about the O(n) time for an operation where N is never bigger than 10, you're probably falling victim to the siren call of premature optimization. A more subtle way I find that people fall into the premature optimization trap is by getting sucked in to profiling tools. I love a good profile as much as the next programmer, but profiling is devilish work that should only be undertaken when you know you have a problem to solve. But how do you know you have a problem to solve before your problem is affecting your users and costing you business?
The easiest thing to do is simply capturing the runtime of important chunks of work. If you happen to be using a service model of system design, capturing the runtime of each of your exposed endpoints is a good start. For those that happen to be using the Play framework (1.X), this is easily accomplished with a @Before in the controller that records start time in a thread local, and an @After that logs the total time for the request Now you can at least see what's running and how long it is taking.
Unfortunately, that alone does very little for detecting performance problems before they hit your live traffic. It also does very little to show you trends in performance without some additional work on your part. The first thing you have to do is actually save the information you're collecting over time. We live in a big data world, but you may not want to save every bit of log output from your services logs forever and ever. And most of the output will be uninteresting except as an aggregate trend. You want to see if your queries are trending up over time, as your data or user load grows, but you probably don't care that at 3:43pm on December 13th the getAvailableSkusForDate endpoint took 100ms to return 20 results. You need something that you can look at quickly, that you can keep around for a long time, and that will warn you in advance if something is going to cause you problems. I'm sure there are many ways to skin this cat, but the way that has worked best for me in the past is basic performance testing.

Starting Simple
Basic performance testing requires a small amount of setup. The main requirement is a data set that is as close as possible to your production data set. A nightly dump of the production database is great. This data source should be dedicated to the performance test for the duration of time it is running. The database (or file system or whatever) doesn't have to be exactly the same hardware as production, but it should be configured with appropriate indexes etc so as to most closely match production. For the most basic test, you can simply pick the queries you know to be worrisome, spin up the latest successful version of your trunk code, and time the execution of a set of important endpoint calls against this data. Record the results. You can use a Jenkins build to drive this whole process with fairly trivial setup, and at the laziest just record the results in a log message. Compare this to yesterday. If the difference is an increase of a certain threshold, or the total time exceeds some cutoff, fail the build and email the team. This is in fact the only thing I have done so far in my new role, but it already is providing me with more confidence that I can see a historic record over time of our most problematic queries.

Slightly More
Now that you're tracking the performance over time, a really slow change is not likely to slip into your codebase unnoticed. But really, what you would like to compare is not just today's trunk vs yesterday's trunk, but today's trunk vs today's production. Now things get a little bit trickier. To do this effectively, you may need to refresh the database twice if you are doing any modifications of the data. And if there has been schema evolution between prod and trunk, you need an automated way to apply that evolution to the data before running the trunk changes (this may apply to the basic case, too, but surely you have an automated way to apply schema changes already eh?). There are several nice things about comparing production to trunk. First of all, you can also use this as a regression test by validating that the results match in the two versions of the code. Second, you can feel pretty confident that you directly know that it is the code, and not the database or the hardware or anything else, that is causing performance differences between prod and trunk.
This setup will ultimately be as simple or as complex as your production system. If your production system is a small service talking to a database, setup is trivial. If, on the other hand, your production system is a complex beast of an application that pulls in data from several different sources, warms up large caches, or generally requires a lot of things to be just so to start up, the setup will be correspondingly complex. As with pretty much all good software engineering, the earlier you start running automated performance monitoring, the easier it is likely to be to set up and the more it will likely influence you to spend a little extra time making your system easy to deploy and automate.

Gaga for Performance Monitoring
I have only touched on the tip of the iceberg for performance testing and monitoring. There's a world of tools that sysops and dbas have to track the load of systems and performance of individual database queries. You can use log analysis tools like Splunk to identify hotspots when your system is running under real user load (a weakness of our basic performance testing framework). But I have found that all these complex tools cannot make up for the feeling of security and tracking that a good performance testing suite provides. So give it a shot. If nothing else, it can give you the confidence that all your fun time playing with profiling tools actually has a trackable difference in the performance of your code base.

Valuing time, and teamwork

I have a guilty conscience about my self-perceived deficiencies when it comes to teamwork. I'm just not one of those people that sits at her desk until 10pm every night, or even most nights. I don't like eating dinner at work at all, in fact, and I'm not much for socializing at lunch either. It's not that I don't like my colleagues, or don't want to work hard. But I prefer a more focused day with a break for a meditative workout at lunch, I like to see my boyfriend for dinner at night, and while occasional (or even frequent) after work drinks are fun I enjoy having a social life that does not mostly revolve around work.

Perhaps as homage to my guilty conscience, or perhaps as testament to my impatience with wasted time, I put a lot of energy into doing things right the first time, and setting up systems that make it easier for everyone else to do things right the first time as well. Nothing is more awful to me than spending my time cleaning up foreseeable messes, so instead I've learned to set up automated builds, write and maintain extensive test suites, and monitor them like a hawk to ensure that they stay stable and informative.

What I have come to realize over the years is that when you view time spent as a proxy for dedication, you lose opportunities for productivity gains. We often forget that "laziness" is one of the trio of strength/vices that makes a great programmer:
The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful, and document what you wrote so you don't have to answer so many questions about it. Hence, the first great virtue of a programmer. Also hence, this book. See also impatience and hubris. (p.609)[Larry Wall]
It's seductively easy to fall into the trap of equating hours spent on work with value produced, but it's as silly as using lines of code as a measure of productivity. Valuing time, your own time, your colleagues time, does not make you less valuable, or less of a team player. If anything, I think many technology teams need more people that know when to take a step back and put things in place so that they can go home at a reasonable hour and live a life outside of work. Which would you rather do, spend 8 hours writing productive code, or spend 6 hours writing productive code and 4 hours fixing bugs that you could've caught with better tests, an automated build, or performance monitoring?

I know what my answer is.

Do the Right Thing

Two weeks into startup life, and things are going great. It's a fascinating change from enterprise development, and so far a very enjoyable one. It has already also provided a good life lesson that illustrates why I am a bit of a zealot about certain structure in the process of software development.

Almost the moment I came in to the company, I was pulled in to help on a project related to the way we compute availability for a given item on a date. This project had been started first by my boss, and had grown into a major effort that touched many parts of the system.

The logic and data design of the system was very thorough, and seemed like a sensible rewrite. A lot of the existing logic lived in a spaghetti monster of PHP code nestled into our front end, and the new system is written as a service using the Play framework. The general development pattern was to point the front end at both the existing logic (to make decisions) but also have it call the new back end service, and eventually swap everything to use the back end. So far, so good. But during development, two major mistakes were made.

First mistake: Not setting up an automated build from the moment the project was first put into source control
I was a little surprised to come in and see that while the system had tests, and developers were running the tests on their changes, there was no automated build set up to run them when people checked in. But, no big deal, I got our sysops guy to build me a machine and we installed Jenkins. Getting the Play build to run in Jenkins was the trivial matter of installing the play plugin and configuring a build to check the code out, clean it and run auto-test. So far, a matter of just a couple hours.
And then the build ran, and the tests failed. I figured they just had some failures that were recently added. But no, the tests ran fine on my local machine. And they ran fine on everyone else's local machines as well. The failures, after some debugging, seemed to boil down to dates. Instead of using Joda Time from get-go, we had a bunch of logic around java.util.Calendar. The new machine was running on UTC, and despite seeming to set the timezone to New York, we had failures all over the place. So, after too many hours trying to solve the problem with piecemeal moves to Joda time, I took a day to completely overhaul all the date logic to use joda.
And still, the build was failing.
Now, I could have let it go at this point, but I had a nagging feeling. If our tests are failing on our UTC build box, what do you think our code is doing on our production UTC machines? Bad things, probably. So I kept digging away, and finally discovered that I needed to set the timezone as a -D parameter to the play framework on startup. And after a long day and a half of struggling, we had a working build, and a much better understanding of how to properly use dates in our system. But this wouldn't have cost a day and a half of developer time if it had been set up from day 0 of the project.

Second mistake: Not thoroughly testing the framework interaction
The month of December is far and away the busiest month for the site. So it should come as no surprise that we would hit our new system with a lot of traffic during this time. At some point mid last week, our system suddenly slowed to a crawl and the database started spiking its load. Frantic hours were spent turning off logic before things finally calmed down. We assumed it was due to poor sql optimization, and so we optimized the queries, but still things were going slow. Finally, we discovered that a particularly heavy-weight call was being made with user id 0, and long story short, bypassing the logic for that userid made everything hum again.
But why were we getting any calls at all with that userid? Must just be a bug in the spaghetti code of the front end system. Turns out that was true, but not in the way we expected.
The Play framework has a relatively nice way of developing. You write these controller classes which expose endpoints. The parameters to these methods can be annotated with this nice @Required annotation. Now, we assumed that @Required meant that any call that didn't have that parameter would fail. But we never bothered to write a test for this fact. So, fast forward to Friday. I'm debugging a warning message that we seem to be getting far too often, when I realize that we have a bunch of calls coming in, and being executed, without some parameters. But they weren't marked as @Required. So I told the front-end devs that they needed to pass those parameters, and went to make the @Required. And as per my zealotry, I started writing a test that would actually POST to the framework without the parameters, expecting it to be a quick matter of verifying the failure.
As I'm sure you can guess by now, the test didn't fail. Why not? Well, two reasons. One, you have to explicitly write a check to see if the method parameter validation failed to have the @Required do anything. Nice. Second, those parameters were being declared as primitive types. But to do the validation, we turned them into Objects, and as a result turned a missing parameter into a Long with value 0. Whoops. So up until now, we haven't been causing any sort of error when parameters were missing, and we've been populating our database with various 0 values unintentionally.

We've fixed it all now. But doing the right thing up front would have cost very little time, and saved us probably 10s of developer hours of work, not to mention the quite likely business cost incurred by site instability during our busiest month. Don't skimp on your testing, even in crunch mode, even for the boring parts of the system. It's just not worth the cost.

Interviewing for Judgement

One of the things that occurs to me after my first week of startup work is how essential it is to hire people with good judgement. In an environment where even the interns have huge amounts of responsibility and everyone is under a lot of time pressure, the judgement to not only know how to make good technical choices but also to know when to ask for help is essential. Is the right fix to this bug to hack around it, or to take a moment to write a bit more scaffolding and fix it everywhere for good? Which schema changes require another set of eyes, and which are trivial? As a manager you're also a contributor, and you probably don't have the time to micromanage decision making, nor would you want to. But when you discover in code review that a feature you thought would be simple required several schema changes and some sketchy queries, you regret not having insight sooner in the process.

Hiring for judgement is hard. It's easy to hire for technical chops. We know how to screen out people that can't write code, and if you follow the general best-practices of technical hiring (make them write code, at least pseudocode, a few times), you're unlikely to hire candidates that can't produce anything. But the code-heavy interview process does little to ferret out developers with judgement problems, the ones that can't prioritize work, and don't know when to ask for help. If you're hiring for a company the size of Google it doesn't matter that much, you can always put a manager or tech lead in between that person and a deadline. Small companies don't have this luxury, and it is that much important that the screening process captures more than just sheer technical ability.

I was hoping as I wrote this that I would come up with some ideas for hiring for judgement. But I have never had to hire for judgement until now. Judgement was something I had the luxury of teaching, or simply guessing at and hoping for the best. The internet suggests questions like
Tell me about a time when you encountered a situation where you had to decide whether it was appropriate for you to handle a problem yourself, of escalate it to your manager.  What were the circumstances?  How did you handle it?  Would you handle the same situation the same way again or would you do it differently this time?
This seems to me better than nothing, but not amazing.

So, over the next few months I'm going to experiment with questions like these and see how they play out. There's no way to do a rigorous study but at least I can start to get a feel for which questions can even tease out interesting answers. And if you have any questions you like, or thoughts on the matter, leave me a comment.

A new computer

I'm writing this from my brand-new Acer netbook. It's a cute little machine, very light, long battery life, and so far seems to run Eclipse well enough for me to do some simple ZooKeeper work and other light Java development. It's also, hopefully, a symbol of a new chapter in my life.

A few months ago, I was reading The Creative Habit. I've been in search of my own creativity for a while now. You could say my career as a programmer meets the mythical 10,000 hours rule; after putting in several years of intense schooling followed by several years of focused work as a software developer, I finally started to consider myself an expert at writing general-purpose code. It's great to feel confident that you can code almost anything pretty well, but at some point I started wondering when this expertise would turn into truly creative output. I went into computer science for the cliched-but-true reason that it's a skill that can be applied into almost any sort of industry, and hoped it would allow me to build a fulfilling and lucrative career wherever I decided to go. And it has, except, where is my cool side-project? Beyond the creativity needed to architect solutions for work, I haven't found my groove.

So I started reading, and exploring, and trying to break out of my work-focused rut. In The Creative Habit, Twyla Tharpe recommends having the tool of creativity for your trade with you at all times. For a writer that might be a notebook and pencil, for a musician a tape recorder. But what is it for a creative developer? A poll of friends brought us to the conclusion that it might just be a small laptop and Python. A friend put it eloquently:
I'm picking "python" because it seems that the writer's pencil or the artist's sketchpad are more for making rough sketches than finished products, and python is one of my preferred languages for quickly hacking up prototypes.
Now here I am, I've finally taken the plunge, bought the little laptop, even started the blog to chronicle the process. 3, 2, 1, GO!

NoSQL and the Enterprise Developer

One of the people I follow on twitter, @strlen, posted a pretty good comment on Hacker News the other day. In it, he calls for NoSQL stores to become better than they currently are (a notion I doubt anyone would disagree with), and mentions some of the things he would like see evolving in the NoSQL landscape:

* Support for new and interesting distribution models. Allowing users to choose between eventual consistency, quorum protocols, primary copy replication and even transactional replication.
* Support for large, unstructured blob data[...]
* Most NoSQL systems support transactions within the scope of a single value (or document) via the use of quorums, serializing through a single master, etc... However, it'd be nice if something like MegaStore's Entity Groups (or Tablet Groups in Microsoft Azure Cloud SQL server) were supported. 
* Secondary indices, whether internal or external (by shipping a changelog) to the system. 
* True multi-datacenter support (local quorums if desired, async replication to the remote site) including across unreliable, high latency WAN links (disclosure: Voldemort supports this -- https://github.com/voldemort/voldemort/wiki/Multi-datacenter... )


These are all great points. In particular for the enterprise space (and especially the financial space), I think the first and last points are extremely interesting. 


A major concern for the financials is business continuity. If a data center goes down, you had better be able to keep the critical parts of your business running. This has traditionally been done in a few different ways. One major way is through the use of SRDF disk, a rather slow and expensive facility that will automatically mirror data from one disk to a backup disk in a different site. For it to be performant at all, the two sites are generally kept pretty close together, with a fat link connecting them. But the overhead of the synchronous write and the cost of the disk are still meaningful, and the ultimate reality of dealing with SRDF failover of a database or file system is that frequently system administrators and DBAs need to get involved and the failover time can be quite slow. It satisfies certain regulatory requirements, and it satisfies the basic needs of business continuity, but rarely in a clean and easy to use fashion.

Now, many NoSQL systems can do some level of data replication across data centers. I personally chose to use Cassandra for a project because of the fact that I could choose write-level coherence that would guarantee writes hitting a quorum of global servers, thus assuring no data loss even in the event of a single data center failure. And hand-in-hand with point number one, this configurable read/write coherence meant that I could have a system that would always be available for reads even if a region was network partitioned from the other global regions, and would always guarantee that a quorum of servers would see a write before committing thus guaranteeing no loss of data.

Here's a tricky point quorum-based system designers should know: Many enterprises don't have data centers set up to support quorum-based systems in a local region. Often you will see 2 data centers per global region, meaning that if you need to run a quorum-based system and withstand the loss of any one data center (a general requirement for high availability business continuity), you need to have data crossing the WAN at some point. To a distributed systems programmer, this is agony. If only I had 3 data centers available in-region the possibilities for quorum-based systems to keep my data safe while still having relatively fast writes becomes so much more! But don't count on that being available to your clients.

A few glimmers of hope are on the horizon. Companies are aware of the cloud, and some are investigating whether they can use external cloud providers to host some computing. If this becomes a possibility, a cloud datacenter could become the third center in a quorum-based system. Regulators are also taking a closer look at data center locality, and wondering if there isn't too much of a concentration risk with two data centers so close together within a geographic region. This may prompt build out of additional data centers farther away in the states, but with better network connections than a cross-Atlantic link.

NoSQL folks looking for the enterprise and financial services markets, take heed. There's desire out there for what you are selling, but if it isn't easy to meet business continuity and regulatory requirements, you will never gain more than a niche position at these firms.

There's one other todo in the NoSQL space around authentication, but I will take the advice of my post reviewers and save that for a later rant.

ZooKeeper 3.4: Lessons Learned

After several months on the planning block, it looks like ZooKeeper 3.4 is finally almost ready to be released. (Edit: Hooray! As of 11/22, release 3.4 is available!) I can say with confidence that all of the committers for the project have learned a lot from the course of this release. And most of it is in the form of "ouch, lessons learned".

First lesson: Solidify your new feature set early.
Going through the Jira, the earliest new feature for the 3.4 release is the uplift of the ZAB protocol to ZAB1.0. No small feature, to be sure, we were still debugging minor issues with it through the very end stages of our 3.4 work. We also added multi transactions, kerberos support, a read-only zookeeper, netty, windows support for C, and certainly others I'm forgetting. Some of these features were pretty simple uplifts, but some of them caused us build instability for months and a great deal of distraction. Many of these were added as "just one more feature". But many other features were neglected because "we're almost ready for 3.4" (as it turned out, often not actually the case). If we had decided early what new major features we were pushing for with 3.4, we could have concentrated our efforts more effectively and delivered much sooner.

Second lesson: When it's time to push, push.
Giving birth requires a period of concentrated pushing. If you think you can push a little now, then put it off for a few days, then a bit now, then a few weeks off... the baby will never come, and neither will the release. It took several attempts before the community finally rallied behind the efforts to get a release out, and we ended up losing a lot of momentum in the process. We didn't have a solid and pre-agreed-upon features to know when we were done, so things just kept getting in the way. When the attention on the release was off, a minor bug or feature request would come in and it just seemed so small, what was the harm?

Third lesson: Prioritize as a community, and stick to those priorities
This falls in with setting up a feature list early, but it goes beyond that. Our community was split between those who were very interested in seeing 3.4 released, and those who were working on major new changes or refactorings against trunk. As a result we all ended up feeling shortchanged. Contributors with new features did not get the attention their features needed, and many still sit in unreviewed patch form. Users that were hungry for the 3.4 release were frustrated with our lack of attention to getting it out. We had some massive new refactoring efforts that continued to happen on trunk during the course of the release process, which resulted in a frustrated committer base stuck backporting or forwardporting patches between increasingly divergent branches. These efforts found bugs, but not without some cost. Having unclear priorities divided the community, caused some tension, and ultimately slowed the whole release process down.

Fourth lesson: You can always do more releases, it doesn't all have to happen now
This is perhaps my own biggest takeaway from this process. I wish we had done much less, done it much faster, and been willing to release a 3.4 that was quickly followed by 3.4.1, 3.5, etc, as needed. Proponents of agile development and release practices have a good point; the more often you release, the less there is to go wrong and the easier it will be to fix if and when it does. It becomes a self-fulfilling prophecy. We don't release frequently so people want to cram as many new features in as possible, which slows down the releases, which results in pushes for more new features, which results in more bugs and slowed down releases, and on and on.

These lessons may seem obvious in retrospect, but they came at the price of many people's time and effort. I'm proud of our community for pulling together in the end, but I also hope that 3.5 will be a different and less arduous journey.