We do exactly as you do currently. Only a handful of devices are ever on the QA system. When a new update comes out, we test Deployment, Software Distribution, Patch and DTS. Once those are shown to work with no problems, we update PROD. I don't think it's worth the effort to test ALL your systems. That's a lot of work. And in our case, VMs really help with the testing regimen.
We're about to do complete our SU3 update since everything seemingly works on QA. Naturally, we sometimes pick up the pace of the QA-to-PROD phase if there are defects corrected on which we've been waiting.
That's a somewhat "how long is a piece of string" question in a few ways . I'll try to share some of my experiences. I'm usually dealing with accounts that are anything between 10k - 100k nodes, but there's still a lot of "know thy environment" here. A well-disciplined structure is going to make life a lot easier compared to managing 1/10th of the same node-count in a "wild chaos jungle / IT anarchy" organisation.
Depending on how open / locked down your environments are, this can be an easier or a harder thing. Case in point:
- Dev is usually "super-open" as you're just doing PoC type stuff and trying to get things to work in the first place.
- REF (if you've got that) environments are sensible thing to have (in my eyes), as you tend to have to put up with "the lockdowns" and so on of your prod environment. IDEALLY, the Ref environment should be 100% identical to your PROD, but I've yet to see that happen anywhere.
- Certain things can be tricky to test / come across in either DEV or REF. Case in point - "load testing" ... I was at an account with 30k+ nodes, and we had developped something which was fine in dev & ref ... but caused problems in PROD, because of the volume of data involved. The DEV/REF servers just didn't have the hardware specs to even remotely approximate / handle the quantity of data in live, so that's a risk you always need to look out for.
- *IF* you're in a lucky position and can have a sufficient server resource in at least REF / QA to cope with a data-dump from live, that *CAN* be very beneficial. It doesn't need to be the same spec (since you're not going to be processing / querying the data in real-time to a similar extent) -- but it'll help you find early any situation where you will be impacting your live server due to sheer quantity of data handling (be it intentional or not). Processing 5 scan files, is quite different to processing 50k .
- In general, a "staged roll-out" in live is a sensible approach. By and large - the same as you would with a soft dist / patch package. You begin with "friendly targets" (your own team usually), then "competent, friendly relations" (friendly IT groups) ... and if everything is green, continue rolling out in stages. A good chunk of "surprise" will be dictated here by how homogeneous (or diverse) your hardware stack is.
- ... in related news, that's why I try to make friends with "the worst case scenario" people (i.e. - someone running a potato of a computer, behind a 36k modem, that's going up and down like a yo-yo) and bribe them (food is usually all it takes) into being a willing participant ... so that you have a good (and friendly!) test case of the "worst case scenario" before it gets to the other "worst case scenarios" out there). *THAT* specific inclusion of the 'bad situation test' can be a huge help when trying out new features / ideas.
... hope that helps somewhat?
If you've got other questions, feel free to shoot.