jump to navigation

Scrubbing the Data 04/07/2009

Posted by TBoehm30 in Data.
Tags: , , , ,
1 comment so far

Data needs to be ‘clean’ in a database to be worth anything. How can management trust their reports if their previous decisions turned out wrong based on bad data?

My favorite story on fixing the data was one from a seriously good techie named Mike. Every month or so, several departments would get called out based on the percentage of errors in the system. They got called on silly things like end dates before start dates, check boxes not checked when required, etc. These would have been easy to constrain if they had the ability to do that to the system. Unfortunately they were not allowed to make changes to their back end system.

Mike put in a new CRM system as a front end, and put in controls to prevent the users from entering ‘stupid’ data. This effectively gave them a 0% error rate. The next month Mike got called out because management didn’t believe the data. They assumed something was wrong with the data since it had no errors. That just wasn’t possible.

Not only was it possible, but it made Mike’s department the model for the rest of the company. His (Our) CRM system started to become the new front end for other departments that could use it. We made it a practice to put good business logic anywhere that data could get into our system.

The question then becomes: Where does it make the most sense to scrub the data? If you have control of 1 system, then any place where data goes into that system is an opportunity to clean the data. On user’s screens is the obvious location. Don’t let users enter mistakes into the system. Correct them before the data is saved. For example, if you have a dropdown choice, and require a checkbox for one of the choices, then put that in code. Even better, if the checkbox needs to be checked only 80% of the time, but half the time, the user forgets, then make sure they get a prompt reminding them of the checkbox.

Not so visible entry points include interfaces. Interfaces push, pull, send, or receive data from other systems. You may not have control over those other systems. Creating one set of APIs or controls is ideal. You should use the same set of requirements for any data coming into the system. Then you need a process where data can be fixed when it is found to be in error.

When an error is found, there are several options for what to do. You can route the data to a human, you could send an error back to the original system and not accept the data at all, or you could flag the data for future follow-up. What you shouldn’t do is allow the bad data to enter without any notice that it is wrong. You need your business logic to scrub the data for problems and inconsistencies.

If you have control over more than 1 system, then you may have more than just a few entry points. This project gets more difficult as you find more problem points with your systems. Obviously, any connection points between your systems need to be clean. Any outside connection to any of your systems need to have the data scrubbed.

Over the short run, having an end date before a begin date is not a big deal. In the long run, however, it can completely screw up an analysis. Not having the correct documentation for a sale could be the final straw pushing a manager’s dashboard control into the red and causing him to make bad decisions.

Business managers need to look at their simple errors and work with IT to improve their business logic. They need to be aware that it’s a global world out there and Technology makes it happen.