For the first time in 3 years, Skype was down today - and as I write this is still in the process of slowly coming back online. A ton of articles were written today, mostly all pointing back to Skype's blog post or status update, which most importantly said this (I've shortened it a bit):
Some of these computers are what we call ‘supernodes’ – they act a bit like phone directories for Skype. If you want to talk to someone, and your Skype app can’t find them immediately ... your computer or phone will first try to find a supernode to figure out how to reach them.
Under normal circumstances, there are a large number of supernodes available. Unfortunately, today, many of them were taken offline by a problem affecting some versions of Skype. As Skype relies on being able to maintain contact with supernodes, it may appear offline for some of you.
Let's explain this a bit more.
Explaining Supernodes
If you go back and read my primer on the technology behind Skype and P2P networks, I described supernodes as Skype clients that are on the public Internet and NOT behind a firewall or NAT device that broker the communication between two Skype clients. In a very simplistic view, the picture looks like this:
As I note in the update section to that post, the Skype clients acting as "supernodes":
perform the somewhat limited functions of connecting nodes together, providing a distributed database and choosing appropriate nodes to act as "relay nodes" when necessary.
The supernodes are what connect invidividual Skype clients to each other and create the P2P "overlay network"... the "cloud"... that connects all Skype clients to each other.
These "supernodes" run the regular Skype software. The ONLY difference is that they are on the public Internet. So if you are running Skype on a computer - and you are NOT behind a firewall, there is a chance that your computer could become a supernode. That's just how Skype works. So there are a lot of these supernodes out on the public Internet:
Here's the thing... EVERY Skype client is connected out to a supernode. You have to be, in order to be connected to the larger directory of Skype users and for them to know how to reach you. (Note that Skype clients behind the same firewall may not be connected to the same supernode.) So it may look like this:
The supernodes are then connected to each other... creating Skype's globally distributed directory database, which in a simplified form you could think of like this:
(Skype's supernode connection algorithm is presumably more complex than the simple mesh I'm showing here... but the point is that they are connected to each other.)
Now, Skype's picture is not exactly like this. We know from the explanations of the 2007 outage that Skype uses a hybrid architecture that involves some "authentication servers" that Skype clients connect to in order to first be granted access to the Skype P2P cloud. I'm not aware of anyone publishing technical details on exactly how those authentication servers connect into the Skype infrastructure, but let's just say it looks something like this:
Skype clients need to connect to these authentication servers in order to validate their username and password, and presumably to validate their calling plan, how much money they have left in their account for calls, etc.
Now, the cool part about the "self-healing" aspect of the supernode architecture is that if a supernode goes down, Skype clients will simply attach to another supernode:
The problem with the outage today seems to be, from Skype's explanation, that a great number of supernodes went offline, tearing apart the fabric of Skype's P2P network overlay:
OOPS.
Something broke. We don't know what. Skype's blog post says only:
Unfortunately, today, many of them were taken offline by a problem affecting some versions of Skype.
What was the "problem affecting some versions of Skype"? No clue. Was it a software update that somehow affected the supernode algorithm? Did it affect the communication with clients?
No clue.
But according to Skype, that's what happened. Hopefully they will be a bit more forthcoming soon (although perhaps NOT, given their pre-IPO status), but at the moment that's all we have to go on.
My guess would be that there might also have been "cascading failures" in this scenario. If there was, say, a software update affecting some supernodes, as those supernodes dropped offline, the increased load of Skype clients trying to connect to online supernodes might have caused some of them to then drop offline. Or when a supernode came back online, it may have been overwhelmed by the quantity of connection requests and soon failed again. As I said, that's purely a guess... but you could see those kind of failures happening in a situation like this.
Skype's "Solution"
As a solution, Skype's blog post says this:
What are we doing to help? Our engineers are creating new ‘mega-supernodes’ as fast as they can, which should gradually return things to normal. This may take a few hours, and we sincerely apologise for the disruption to your conversations. Some features, like group video calling, may take longer to return to normal.
No details yet on what these "mega-supernodes" are, but some speculation is that instead of relying on individual Skype client computers to "become" supernodes, Skype is going out and setting up computers/servers specifically as supernodes. Rather than rely on potentially unstable computers, Skype goes out and gets some rock solid servers under their own control and sets those up as supernodes.
Maybe that's what a "mega-supernode" is. Maybe it's a higher level supernode... to which "regular" supernodes connect. Again under Skype's control... but providing a tighter core P2P network that houses the overall directly.
We don't know yet... but those are the kind of things Skype could be doing. Again, hopefully we'll get more details soon... although we'll have to see.
As I write this, my Skype client shows 4.5 million users online... it's the beginning of the day in Europe and I'm sure folks there are trying to get online. Hopefully Skype will be getting their network back online soon.
And hopefully we'll get some better technical explanations, too!
No comments:
Post a Comment