[plug] Systemd, good or bad

Mon Sep 22 11:59:11 UTC 2014

So you basically fire and forget, but then chase down problems caused months ago when you last made a change but didn't bother to look at the console at the time of the reboot, or didn't bother to reboot after making changes on prod?  And you test on staging, but you rarely reboot and never look to see what happened during boot, only afterwards if it didn't work and didn't come up?

> On 22 Sep 2014, at 19:30, Brad Campbell <brad at fnarfbargle.com> wrote:
> 
>> On 22/09/14 17:53, Hani Jabr wrote:
>> Fragile systems?  Heck yes. Not the base OS generally, but apps, hardware (how many lots of firmware patches for hardware (inevitably Intel) this year so far?), disk and cluster migrations, app upgrades, OS upgrades.
> 
> I'm relatively lucky in that the stuff we work with is actually quite stable. we get to specify the software and hardware and it gets built to our standards.
> 
>> Do you not run production systems?
> 
> Absolutely we do. We just don't do _fragile_.
> 
>> When you make a change, is it really acceptable for you to take time tracking down an issue caused by a typo or a misconfiguration 6 months or more ago?
> 
> Absolutely. In fact our clients rely on it. We are lucky enough to be able to spend days chasing down the smallest stability issue.
> 
>> Do you not test your changes to make sure the box will come up if it crashes or if you reboot it for some other reason?
> 
> Not on the production systems, no. Every change is meticulously tested in a staging environment (usually 2 or 3 times to ensure the procedure is correct before deployment and the entire procedure is often scripted to preclude fat finger errors). This means cloning live machines to ensure the data set and configuration are current (which can take quite some time on large arrays). Experimental code or systems can go through months of rigorous testing before they're certified for deployment.
> 
>> Do you not have change windows or someone wondering why stuff was down for longer than you said it would be?
> 
> We have "maintenance windows" if that's what you mean, but they never run over and are frequently faster than planned because all maintenance is pre-tested (see above).
> 
> We have had systems in WA sites that have gone 20 years without a reboot. Obviously not common, nor exposed to the outside world. But we do _reliable_ and mission critical.
> 
> I'm quite OCD about stability issues, so I've worked my way into a business that not only entertains my neuroses, but relies on them.
>