jump to navigation

Own your Data 11/29/2010

Posted by TBoehm30 in Data.
Tags: , , , ,
add a comment

“There is too much data at my company to be useful.”  “The data we have is old and out-of-date.”  “Users around here don’t enter good data.”  How often are these statements said or thought?

Data is important and should be controlled properly.  The only way to ensure that you have good data is to make someone responsible for it, and have them own the data.  They need to use the data as well as have authority over the information put into the database.  Only when someone relies on data will they take an active part in making sure that the data is valid.

I once had a meeting with users who didn’t like the process of putting in a caller’s first name and last name in the text boxes provide.  They wanted to put all of the caller’s information into the notes section.  I was supposed to figure out how to make that work.  You can’t report on notes, you can’t sort on them, you can’t aggregate them, they are practically useless as a management tool.  I had to tell the users to get in line with the processes of the day and fill in the caller’s information in the proper location on the screen.

Do you have documentation on how data flows through your company?  Do you know where your data comes from?  How good is your data?  How often is it refreshed?  Do you have duplicate information?  What about duplicate sources of information?  If somebody came to me and told me that a report was wrong, I can trace it back to the source to figure out the problem.  Maybe he is looking at an old report, maybe the data is coming from the wrong columns, and maybe a data point or two are bad.  Whatever the problem, I know how to find the original data, determine the refresh rate, evaluate the quality of the data and explain any variance.

Once data gets old, it becomes difficult to validate.  You will need to de-dup your data, or find and remove any duplicates.  This is impossible if you don’t have good data.  How will you know if 2 entries with the same name are the same person or different people who just happen to have similar names?  I don’t know many people who can go through thousands of records to find duplicates and finish without going crazy.  If you find someone like that, keep them around.

Software exists that can do a lot of the validation for you.  You might have to give up control of your data to use it; or it might cost a lot of money.  The last time I looked at that kind of software they gave us percentage rates on the validity of the results.  I wouldn’t use the cheap stuff that had less than an 80% chance of being right.

I once worked with a guy who had to de-dup a huge database.  He explained that very few people with the same last name were born on the same day.  So, if you had their last name and birth date, you could be reasonably certain of duplicates.  Of course, twins and multiples are a problem.  In that case you add in the first name and it is extremely rare to find duplicates that were in fact more than one person.

He told me about twins with the same first name, but different middle names.  Why would a parent do that?  He told me about a time when he had a name with two addresses and two Social Security Numbers that were one digit off.  He just knew that the SSN was a typo for one record, but he couldn’t prove it.  He had to list the two records as two people and it was bugging him.  So, he drove by the guy’s house to see if the addresses were correct (he wasn’t allowed to contact anyone because of privacy).  Sure enough the addresses were on a corner and looked to be the same house.  What could he do about it?  Nothing – he was not allowed to fix the data.

What sort of data issues do you wish you could fix at your company?  Do the right people own the data and take care of it?  Do you have governance to control the quality of the information?

Companies need to value their information, validate it often, and use it to their advantage.  After all, it’s a global world out there, and Technology makes it happen.

Scrubbing the Data 04/07/2009

Posted by TBoehm30 in Data.
Tags: , , , ,
1 comment so far

Data needs to be ‘clean’ in a database to be worth anything. How can management trust their reports if their previous decisions turned out wrong based on bad data?

My favorite story on fixing the data was one from a seriously good techie named Mike. Every month or so, several departments would get called out based on the percentage of errors in the system. They got called on silly things like end dates before start dates, check boxes not checked when required, etc. These would have been easy to constrain if they had the ability to do that to the system. Unfortunately they were not allowed to make changes to their back end system.

Mike put in a new CRM system as a front end, and put in controls to prevent the users from entering ‘stupid’ data. This effectively gave them a 0% error rate. The next month Mike got called out because management didn’t believe the data. They assumed something was wrong with the data since it had no errors. That just wasn’t possible.

Not only was it possible, but it made Mike’s department the model for the rest of the company. His (Our) CRM system started to become the new front end for other departments that could use it. We made it a practice to put good business logic anywhere that data could get into our system.

The question then becomes: Where does it make the most sense to scrub the data? If you have control of 1 system, then any place where data goes into that system is an opportunity to clean the data. On user’s screens is the obvious location. Don’t let users enter mistakes into the system. Correct them before the data is saved. For example, if you have a dropdown choice, and require a checkbox for one of the choices, then put that in code. Even better, if the checkbox needs to be checked only 80% of the time, but half the time, the user forgets, then make sure they get a prompt reminding them of the checkbox.

Not so visible entry points include interfaces. Interfaces push, pull, send, or receive data from other systems. You may not have control over those other systems. Creating one set of APIs or controls is ideal. You should use the same set of requirements for any data coming into the system. Then you need a process where data can be fixed when it is found to be in error.

When an error is found, there are several options for what to do. You can route the data to a human, you could send an error back to the original system and not accept the data at all, or you could flag the data for future follow-up. What you shouldn’t do is allow the bad data to enter without any notice that it is wrong. You need your business logic to scrub the data for problems and inconsistencies.

If you have control over more than 1 system, then you may have more than just a few entry points. This project gets more difficult as you find more problem points with your systems. Obviously, any connection points between your systems need to be clean. Any outside connection to any of your systems need to have the data scrubbed.

Over the short run, having an end date before a begin date is not a big deal. In the long run, however, it can completely screw up an analysis. Not having the correct documentation for a sale could be the final straw pushing a manager’s dashboard control into the red and causing him to make bad decisions.

Business managers need to look at their simple errors and work with IT to improve their business logic. They need to be aware that it’s a global world out there and Technology makes it happen.