Sunday, October 21, 2007

Scale in an Ontology Vaccum

I had the pleasure today of reading Clay Shirky's Ontology is Overrated. Being in the collaboration software business, understanding how people classify artifacts is not just a science, it is vital to the evolution of the software we build.

For our purposes, I think it's a dual-edged sword. Make things too wide open, and users won't evolve their own classifications and people will stray away from the true strengths of the tools. Make things too ontology-driven and users will naturally migrate away from the structure, save everything locally and keep emailing and searching with Google Desktop. We have to build a better mouse trap to survive.

For me, a big take-away from Clay's piece was that any attempt to define classifications will fail. The Dewey 200 category is a fantastic example and one that really drives home our inability as simple humans to step out of context.

So, to paraphrase Clay's hypothesis a bit, any attempt I make at trying to define an ontology will fail, mostly because there is no way I can out-think the masses. Ok, fair enough, so how to build software that better helps the masses? Not too difficult at face value, things like tag clouds work great. Unfortunately, the implementation problems grow exponentially under scale. As the volumes of information available electronically continue to grow, and as information compounds on top of itself through links, blogs, etc, the notion of aggregating tag clouds, and more importantly, doing things like suggesting similar clouds shifts from a usability problem into an architectural one.

For collaboration software to innovate, it is faced with several interesting dilemmas. First, how do we expand beyond the bounds of our own control and pull in relevant data outside our domain? It's scary, but if you believe that collaboration software must go down that path, then we're faced with more integration concerns and less of the simplified workflow and text processing which is how most people think of their collaboration tools today.

Assuming you can go that far, how do you actually make sense of all that data? You start to run into scale problems real fast. Customers want one node to run everything and at the same time roll-up data from five internal sources, and ten external sources with partial federated identity all while being responsive. That's a tall order, and one that starts to make the "C" in "CAP" increasingly difficult to archive.

My gut tells me that Werner Vogels' and Amazon's approach to eventual consistency described in Amazon's Dynamo will win out. When you're aggregating and analysing terabytes of data looking for relevance, you have to make sacrifices.

No comments: