I was recently at a client site doing an NSX workshop/Demo when something in my lab happened that I did not foresee; my only NSX controller was in a disconnected stated and would not reconnect. Instead of having a full blown panic attack, I decided to leave the controller as it is and continue with the demo… What happened next was something I should have predicted.
For those readers that are not familiar with the NSX architecture, let me me recap for you. On the top we have the NSX manager that provides API and UI to the users, then bellow we have a number of controllers that talk to to the ESXi hosts using agents. Now in a normal deployment you will always have 3 or 5 controllers not ONE! In a lab environment where resources are limited you should have three and never ONE, but heck I had one.
So my ONE and only controller was stuck in a disconnected state and anything I attempted to get it back on-line failed, and it was demo time. So I fired up my demo and show the customer my 3 Tier based application that was load balanced using NSX, the firewall rules, the UI and much more goodness. The customer then asked me to change something and then I told him that this was not possible.
The controllers in an NSX environment are used to push updates to the ESXi hosts (they do more but that’s not the point) when needed, and because my controller had died I was not able to update my environment. I could use the environment fine but no updates. The customer then stated that NSX was unreliable and should never meet harsh requirements that a physical device could meet, “See” he said “You just proved my point, ONE software bug and the whole thing is broken”.
In the absence of time I parked the conversation but it led me to thinking…
Now in my role as pre-sales I have been at many client sites and seen many networks both large and small and to date I have only meet ONE client that has actually spent any time thinking about the “Problem of ONE”. What is the Problem of ONE you might ask, well it pretty simple really; if you setup a core router for example, having ONE creates a problem. If that ONE router dies then you have ZERO which is less then ONE and generally not something you want to have. Now in networking world, Network admins negate the “Problem of ONE” by adding a second router. Problem SOLVED!!!
Or is it solved? I will bet anything that 99.99% of network admins add a second router that is the same brand, make and running the same software version as the first router, and then use some protocol to make sure it’s either active/active or active/passive. That ONE customer I mentioned before, figured out that if a software bug where to hit the first router, then their would be a real chance it would hit the second router. The solution for them was to have two routers in a active/active configuration but making sure that both routers came from different vendors.
So can the “problem of ONE” be resolved for SDN, right now I do not think so. There are some workarounds that might provide possible solutions but they are very complex to setup… but I ask you, if 99.99% of networking admins are building redundancy with the “Problem of ONE” built in by design then should we need to worry.
NOTE: My ONE controller died because the night before I deleted the NFS datastore by accident