cyberhoboing with dominic tarr

Working Around the CAP Theorem with rumours-db

If you have been following databases at all recently, you are sure to have heard of the CAP theorem. If not it’s explained very simply here: A plain english introduction to the CAP Theorem.

The CAP theorem is about a database design problem when a database is scaled out to multiple nodes. CAP stands for Consistency, Availability, and Partition Tolerance. Like many classic engineering problems, you can pick any two out of the three options. Except, it doesn’t make sense not to pick Partition Tolerance, because then your database wouldn’t be scaled-out. So really, you can pick either Availability, or Consistency, or neither; which would mean your database sucks.

The problem is explained very nicely in A plain english introduction to the CAP Theorem. There is something very solid about the CAP theorem. It seems immovable — it’s a theorem, after all.

But what if there is another approach? What if, instead of trying to build a better database, we built something that isn’t a database, and so simply avoids the assumptions of the CAP theorem by stepping outside it? But what would this mean? What is a database, really?

A database, I think, is like a bookshelf. You can PUT a book on the shelf, and you can GET a book off. If someone has a book currently checked-out (write-lock) then you can’t read from it.

So, a distributed database is like a group of libraries with an inter-loan system. Each library has most of the books that the others have, but sometimes a member at one library needs a book that must be borrowed from another library. In another situation, a new edition of a certain book arrives at one library before another.

Suppose the libraries are all on different islands, and books and book requests are carried between the islands by brave librarian-sailors. The librarians are brave, but not fearless, and sometimes the inter-loan service is disrupted because of poor weather. At certain times of the year, disruptions are more frequent.

When you go to your local library to get a book out, there is always a chance that there is a new edition of that book that the library doesn’t even know about yet. You could check out a book, read it, return it, and then later, a new edition arrives at your library, fixing important errors in the edition that you read.

How do you know that you are reading the latest edition of your book? Especially in inclement weather?

You can’t. That is what the CAP theorem teaches us.

What kind of systems can you build if you don’t ever know if your data is true? Well, not systems you can really trust, at least.

So, what do you do if you need to know that the data you got was actually up to date? The best you can do is keep on checking back at the library to see if a new edition has arrived. Naively polling the database — but this multiplies reads and decreases performance!

A database-like phone service which has the CAP problem is described in A plain english introduction to the CAP Theorem. In certain scenarios, the clients of the service make a request, which is answered with what turns out to be out-of-date information. This explanation gives us a clue: the clients call the service on their telephones, but the service does not take advantage of the fact that telephones provide two-way communication.

If the clients had to keep calling back to check if information had changed, it would be very annoying for both the clients and the service. But if the service had the client’s phone numbers, the service could call back the clients when the information they are interested in changes, instead of hoping that the client will call back in time to make use of the new information.

This is my proposal: A “database” with push notifications!

This no longer really fits into the library/bookshelf metaphor, but as you can see, works well with the telephone metaphor. Although a client may temporarily have out-of-date data, they will be back up to date as soon as possible. Humans are good at dealing with changing information — why not computers?

The idea of a database with push notifications is fairly new. CouchDB has a replication feature, and a realtime changes feed — but it’s not the primary way to interact with the database.

We need to start fresh! Or start from simpler building blocks. Enter LevelDB. LevelDB is a lightweight, performant database created by the author of Big Table. It’s used in implementation of IndexedDB in Google Chrome, and has a very interesting node.js binding, LevelUP. Putting these pieces together I am working on a database where the primary way to interact is via registering for push notifications.

This is RumoursDB; currently under rapid development, but available for eager users!

A talk I gave on Mission Critical Javascript at melb.js

Adventure Boat Building

why you should NEVER write a package manager

If you are writing a package manager today, you are wasting your time.

The problem is already solved. You will probably discover this as you explore the your problem space, maybe it’s not obvious to you yet, but I think it will be.

Instead of writing another package manager, just use npm. If you are writing a package manager for a new non-js language, do whatever it takes to get that new language to load modules the way node does. The right way.

Many languages have a package manager, but most languages load modules from the $PATH. Anyone who has spent some time using this will know how complicated it can get.

The problem with modules, is sooner or later you get a situation where a module A depends on modules B and C, and C depends on B’, a different version of B.

If your system uses a global $PATH type variable to list where to look for modules (as in most languages), then you can’t have more than one version of a given module. Welcome to DLL hell.

here is a post explaining how ruby’s bundler works. If two different versions of the same package are requested, bundler changes the versions of the other packages, until they agree on a single version for each package.

So, if A is only compatible with B, and C is only compatible with B’, then your app is broken.

You now need to change someone else’s module, a module that was already working, so that it can work with another module. Another unrelated module. That is like having a toaster that cannot be used in the same kitchen as a mircowave.

Microwaves and toasters are completely independent objects and have no right to affect each others behaviour.

This is clearly unreasonable, yet, there are many languages and systems that work like this.

There is a lot of hype about node, but there isn’t enough hype about node’s module system.

When you npm install package it gets installed locally, into ./node_modules/package. When you require('package') it looks for package in ./node_modules, then ../node_modules then ../../node_modules and so on, until it’s found or you have run out of parent directories. It’s so simple you can install modules just by copying files about.

In many languages requiring/including a module adds that name to the global scope. In node, require returns a value that you must assign to a variable.

By keeping modules out of the global scope, and looking for them locally, you can have different versions of the same named module if you like.

This is the right way.

If you can make sure your language handles modules in a sensible way (look locally, store locally) then you don’t need a package manager, because you can just use npm.

Then your life will be easy.

Distributed Programming is Easy

At least, it’s as easy as non-distributed (“local”) programming. Okay, so programming isn’t exactly easy even on a good day. But we still manage to do it anyway.

What I’m trying to say is; distributed programming is not “TOO HARD”.

The problem is that we are resisting the very idea of distributed programming. People are still approaching the arena of distributed programming hoping that something will come along and make it disappear, that someone clever will invent an abstraction that wraps distribution and makes it feel local. Paul Graham even listed it in his list of frighteningly ambitious start up ideas (#6).

It’s not gonna happen.

Instead of trying the change the universe so that it fits our mental models better, lets just change our mental models.

We’ve just got to abandon everything we think we know.

I’ve been investigating distributed programming; and I’ve begun to notice a few patterns.

Distributed programming has a different foundation to local computing, and different facts become the assumptions that an application is built around. Firstly, be idempotent.

Network communication is inherently unreliable. And even if it was, software crashes. You need to be able to resend any message, and not have that apply the same as sending the same message only once.

Also see Life beyond Distributed Transactions

Secondly, know the difference between monotonic, and non-monotonic. A logical operation is monotonic if once it becomes true, it becomes true forever. For example: non-virginity. Once you have become not a virgin, you will continue to be so. Non-virginity cannot expire, or be retracted. This is important because things are are monotonic will naturally be eventually consistent. For non-monotonic things (read: delete from a set) you may need to track state carefully, or make sure all nodes know about an update.

If you can design your application to maximise monotonic, commutative, idempotent operations, your life will be easy.

Also see bloom bloom is a prototype language designed for distributed programming, founded on the CALM (Consistency as Logical Monotonicity) conjecture.

A third assumption is that each individual computer is itself unreliable. Disks fail, datacenters catch on fire, and cell phones get dropped into water (remember, a client is just a type of node). A pattern I’ve observed in studying distributed software is that often, distributed means peer to peer. Even if it’s entirely behind a firewall, the architecture of Amazon’s dynamo (or it’s open source clone, Cassandra) is entirely peer to peer, with no master node.

A phrase you might have heard is “single point of failure”. A peer to peer architecture avoids that.

Each of the techniques described in the Amazon dynamo paper is itself quite simple, but used together they produce a coherent whole that in incrementally scalable, without a single point of failure. Interestingly, dynamo is not itself a actually database. It’s a replication layer. The actual storage of data is achieved via a customizable plugin. It may even be mysql!

I have only just begun exploring this area. As more do the same, we’ll figure out new approaches, and build new tools and frameworks that are distributed by default, then distributed programming will be easy.

why your app needs body-language

Travelling always reveals what you take for granted. I’m currently on my first trip to Asia, and one thing that has surprised me, is just how chilled out the driving is. From what I heard; everyone on motor-cycles, and no rules; I expected horns blaring, people shouting, total chaos.

I was astonished to find road users putting out a vibe way more relaxed than what I’m used to.

What I think it is, is that motorcycles have body-language. For the same reason there is no such thing as pedestrian-rage. If someone makes a mistake on the foot path, they can apologise with a gesture. But if someone makes a mistake in a car, you just assume they did it intentionally, because they are an asshole.

Everyone knows that communication via the internet is fraught with difficulty, there being no notation for emotional tone. But instead of just complaining that email/chat/comments suck, maybe we can do something about it? maybe we can give the internet body-language?

Body-language is interesting because it’s mostly unintentional. Body language doesn’t represent abstract ideas like spoken or written language does, instead it’s the by-product of your emotions and thought processes. And it’s better for that, because people mostly can’t lie with body language.

So I’m not talking about a little icon that says “user is angry/happy/joking” but, rather, to let aspects of their non-verbal, non-textual behaviour fall through to give the other users information on what they are thinking.

For example, in chat, you don’t really know how busy someone is. Are you interrupting them? Manually set statuses may help, but they must be done intentionally so they are more workload. But what if you could, say, see the mouse pointers of each user currently on the page? you could then see how much (real)time the are spending on the page, and have some idea about what they are doing. Thus you could estimate their current attention budget, and thus know how welcome your interruption may or may not be.

This is just one idea, but the value in this approach, I think, is that the information it adds is not created or consumed consciously. It adds no extra perceived work load for the user. I think we need better support for real-time in our application stacks, to pull this off, but we’re working on it.

death to client-server: let all nodes be created equal!

Would you use a hammer to butter your toast?

That is what you are doing if you are using a database to create a application to be used by multiple humans. If you want to build an application with lots of problems, make sure you have mismatched geometry. Pick technologies that don’t match the what is really going on.

What is a database? Basically, it’s like a book, like a ledger or inventory. You can write things, strike things out, and write new things, and if it’s not on paper it’s not real. Databases are a pretty good match for things like banks, or keeping track of what stock you have, so if you are building an application like that, use a database.

On the other hand, if you are building something to be used by humans with humans… then a database is just plain wrong.

Humans have a completely different interface. The fundamental assumption in a database is that all communication is request and response. It is not possible for the database to spontaneously inform a client of something. It is not possible for stock to leap of a shelf and spontaneously deliver it self to a customer. Humans, however, are fundamentally peer-to-peer and real-time. There aren’t two types of humans, one that only answers questions, and one that only asks questions. that would be absurd.

If you try to use databases for human applications, an awkwardness will pervade your application (unless your application is limited to the subset of human activities that do work like clients and servers). Commerse is like that, anything collaborative, anything “social”, is definately not.

What we need, is a new kind of data layer. A data layer where there is no strong distinction between servers and clients - there arn’t clients or servers, just nodes. I don’t mean that a specific node can’t have an important job, like being a leader, but any node should have the potential to initiate communication to any other.

What we need is a data layer that is both peer-to-peer and real-time.

This line of thought is not well developed, but there are a few projects out there that are working towards this. CouchDB has the _changes feed, and redis has pubsub, but these features are only tacked on, rather than baked in.

But there are people working on realtime backends. racer, sharejs, firebase, and my own project crdt If you know of more please let me know. These are still client-server, but my project, crdt, goes one step further and is also peer-to-peer!

so start watching these projects, they are all pretty fresh, but this. is. the. future.

Have You Ever Lied in a Job Interview?

I just read @karlseguin’s blog post, Learning is More Important than Knowing

He says people should interview for the ability to learn, rather than a specific bag of tricks, he’s completely right about learning being more important.

But there is another angle when it comes to job interviews.

Because when the interviewer asks ‘do you know X?’ You know what really happens,..

The interviewee lies, and then leans it quickly.

We’ve all done this, lets just admit it.

Your future boss, if they are any good, probably did it as well.

This is the natural order of things.

Sometimes it’s important not to take things too literally. In Soul of a New Machine (highly recommended) Data General use the production schedule like that. They would ask some new guy; “we need you to do X (some challenging assignment), and we need it in 6 weeks. can you do it?”

They knew full-well that doing X in 6 weeks was impossible, they just wanted the guy to “sign up” . They didn’t have the budget to hire experienced engineers, so they hired recent graduates. They had an exciting project, though,half the company had moved off to Southern Carolina to build the new (non-back compatible) 32 bit computer the “Fountainhead” project. The remaining team, who are the subject of the book, where left to build merely the next 16 bit computer in the Eclipse series. But they decided that was too boring, so they secretly also built a 32 bit computer. Except theirs is also backwards compatible.

Did I mention that this was an exciting book?

Anyway, the point is, they knew the schedule was impossible. What they wanted was to get the designer’s mouths to write a cheque that their ass would have to cash.

A job interview is like that, but less awesome. The real question, once you know they are smart, is “How committed is this guy?”

Now I reckon, the liar is much more committed than the one who says “well, I don’t know X, but I can learn it, I’m a fast learner”

Because the liar will look more foolish if he can’t.

Now, this is all just posturing, it’s not really dishonest; because we all know, he’s just gonna learn it… Which brings me to another point, challenge is more important than money.

So just remember, not everything is always as it seems. Especially not at job interviews.

Challenge: Reliable Reputation Graphs?

We’ve had reputation systems for sometime, being able to say whether someone followed through with a transaction effectively makes things like ebay workable.

However, the implementations we currently have are simplistic and intrinsically gameable.

For example, on ebay, you can see who has good feedback, and who has bad. But what is not taken into account is that reputation goes both ways. Reliable people both act and speak honestly. That is, if you are an asshole, you will trade badly, and possibly also give bad feedback to people who don’t deserve it.

We observe this is everyday life that the references of highly respectable people count for a lot more than people of ordinary levels of respect.

That is not how current systems seem to work, so I think there is an opportunity here.

In flat systems like ebay or reddit, every vote counts the same, so it’s possible to create many fake accounts and get them to vouch for a target account.

how to game reddit with fake accounts

ebay works around this problem by making it as hard as possible to create a new account once you have been banned, work around this by resigning up with all new details, this only stops unsophisticated cheats.

Advagato.org, an opensource community, has an interesting trust metric that goes some way to avoiding this problem. starting from a trusted seed, of the founders of the site, trust is propagated outwards using a network flow graph calculation

This works quite well, but it’s still possible for a troll to penetrate this system. According to the wikipedia page several have. Also see the comments on this article

The trouble here, I believe, is about controversy and opinions. Not everyone agrees on who the trolls are. An opinion may be more obnoxious to you but more tolerable to me. Maybe they have some good ideas, but say them in an obnoxious way. One-size-fits-all fits everyone poorly.

This is also a problem on Amazon. There are many books which have a lot of 5 star reviews, but also a lot of one star reviews. So, is that a good book or an awful book?

Naturally, if I believe in evolution, I’m probably not going to highly rate a book on creationism. but other people will…

The question is, what sort of person am I? and who are my opinion-peers? show me the books that they like to read.

Flat systems cannot infer this sort of information merely from whether you have voted from something or not, but I believe, it’s all there in the reputation graph.

I have managed to find one study that addresses this; Controversial users demand local trust metrics which describes a system trust metric applied to data from epinions.com.

The interesting thing in this paper, is that all users are ranked from the point of view of each particular user. by calculating the distances between users on a graph of expressions of agreement. Corners are cut to enable such a calculation to be performed quickly, and I’m not convinced that ranking by the shortest path really maps onto the real-world problem precisely, but I think the idea of a local rank is key.

The way forward, I believe is a system that combines the approaches and solves the problems approached by both systems. We need a reliable system that is ungameable, but is also personalised, because the important thing about trust is who you can trust.

I agree with @jonbischke that reputation graph is one of the web’s biggest opportunities. But also, if it can detect controversy and cheats… then it has the power to revolutionize society as well.