by Paul Segal, Teradata
In my first real world job I was asked to develop an analytical model to measure the impact of closing some physical branches. The brief I was given was to build the model as quickly as possible and make it reasonably accurate so that it could be run once or twice before it was discarded.
So off I went, following the CRISP-DM methodology, and gathered the data. Then as any good modeler should do, I profiled the data so I could identify any problems that may exist with it. Once those problems were identified, which also included a glaring ID10T (yes, that’s “idiot” in dev-speak) error on my part by extracting the wrong data, I set about the cleansing process, hence making the data fit for processing.
The next step was to process the data getting it ready for modelling, so off I went and joined datasets (not in a database, but on my local PC), and created a whole bunch of new variables that I thought would be useful for my model. This, of course, introduced a whole new set of data quality issues that required correcting.
In all, the above process of massaging my data into a form that was suitable for use in a model, took me about a week. And, then the fun part, the actual model building, started.
At this point, I want to share a bit of an aside: those of you who are in the advanced analytics game today and are under 35 years old, you have absolutely no idea how easy you have it when it comes to data preparation. Really, thank your lucky stars. I will explain further in a future article.
Back to model building. Like most data analysts, I went through many iterations of modelling the physical branch closures, each iteration requiring tweaks to the techniques (as well the occasional switch in technique) and/or tweaks to the data – either the addition of new data (with the associated cleansing), or the creation of new variables (also with occasional cleansing required).
The iterations took the best part of another week, and finally I had a pretty good model.
So I can now run the required simulations. Basically, I put data on specific branches into the model to understand the impact of closing them would have on the medium- and high-value customers, as well as on the most influential customers.
OK, easy enough. And, the analytics all ran smoothly.
Eureka! I was happy. I was finally putting my 4 years of university training use, and my first real-world analytical model was a success! The powers that be were happy.
But, little did I know, I had set myself up for one of the biggest (and most familiar) problems in analytics.
The powers that be were so happy with the performance of my simulation that they came back with more data, more branches, to run through my model, and then more, and then even more, and soon the majority of my time was being consumed by running branch closure simulations. This was not sustainable. I had to operationalise the model and shift the burden of running simulations away from me and onto the business.
Do you remember earlier in this article when I mentioned that I expected this model to be a one-off run? Given that, I naively extracted data without actually noting where I got it from. Nor had I noted any filtering I may have included when extracting the data. Likewise, I hadn’t kept any documentation of the cleansing I performed (including why I needed to cleanse in the first place). Nor the transformations I created. I also hadn’t followed any naming standards when creating my new variables.
Thus, I was ultimately left with an undocumented mess that I had to now operationalise. It took me 3 weeks to unravel the mess I had created for myself. Bear in mind that I was still a real world analytics newbie.
But, I can’t end there. The story does have a silver lining. After all of that, often unnecessary work, I now had a model that:
- followed naming standards in place at the institution
- had all source data identified and documented
- had all data quality issues identified and documented, along with the actions taken to correct the issues
- had all the transformations documented
- had the techniques used to build the model documented including any and all parameter settings
- had documentation describing the outputs generated by the model
With all of this documentation in place I was able to spend another short week to operationalise the model (pushing the data preparation components back inside of the database), and provided a front-end to the model (in Excel) to the business so that they could run the simulations themselves, using the most current data inside of the database.
The moral of this story is that there is usually no such thing as a throw away model. If a model is useful, then someone will want to operationalise it, so always build as if the model you are building will go into production, or be operationalised.
If it doesn’t happen then you haven’t lost anything (other than may a couple of hours extra for the documentation), but if (or more likely when) it happens, you will be thankful that you took those extra couple of hours to document what you had done.