I recently held a session at Microsoft Campus Days 2012 introducing the use of Microsoft Data Quality Services (DQS) and Microsoft Master Data Services (MDS) for making life easier in the task of keeping your data neat and tidy for reporting. Session here (in Danish).
What I am going to answer here are some of the questions that came up regarding the newest Microsoft offering: Data Quality Services.
- Where does DQS fit in in a complex environment with data extraction from several source systems?
- Are you not supposed to correct your data back in the source system rather than changing them with at tool like DQS?
- To add a tool like DQS between our source systems and Data Warehouse is a huge and risky task taking a long time. How can we improve our data without starting a long and costly project?
- How can we automate the process of cleaning our data and avoid manual intervention?
If you are all new to Data Quality Services, you might want to get an introduction first from my session above or from Microsoft’s own introduction.
Data Quality Services in a complex environment
A common issue in a large company is that you have the same type of information kept in several different systems. Let us take customer data as an example. You might have a record of each customer in both the CRM system and the ERP system. Furthermore, you might have inherited customers from merging with or acquiring other companies. One of the features of DQS is that it helps you clean out doubles and you can set up rules for which records to keep. So you just plug in DQS between your source systems and the data warehouse or staging area, right?
No, unfortunately it is not that simple. Before you can start cleaning your data, you will need to answer some questions and make some preparations.
First, you will need to define a repository for your cleaned data. A repository containing all the attributes you need for your reporting and which is accessible to other systems that need the customer info. If you have a data warehouse, you will likely already have a customer dimension. Another approach is to have a dedicated master data repository as a part of a master data system.
Secondly, you will need to find out which record is the correct one. Let us say, you get customer data from the CRM system and ERP system as mentioned above. If a record from each system containing the same physical customer does not match, which one to keep then? In this case you might restructure your creation of customers to just being done in one of the systems and then from the customer repository getting loaded to the other. You can also merge the records choosing the best value for each attribute based on rules in DQS. This is as much a business question as a technical one.
Cleaning vs. correcting at the source
This brings me to the next question. Why do we not just correct the wrong customer records back in the source systems? You are right, it will often be best practice to set up rules and checks in your source systems to avoid the bad records in the first place. Sometimes it is just not possible. For instance, if you get your customer data from third party or if your customer data comes from a web shop where customers themselves enter data. Therefore, the answer is that even though it will always be preferable to correct errors at the source, it is just not always possible. It is always an important task carefully to consider the questions about cleaning vs. correcting the data at the source. At first, it can often seem difficult to make changes to a source system and a lot easier to implement a data cleansing tool. The difference is that setting up checks or rules in the source system will often be a one-time task where cleaning the data will involve ongoing maintenance.
An expensive and risky project?
Ok, so introducing Data Quality Services in your business will be expensive, there are tons of business questions to answer first and you will totally need to restructure your way of doing things? The short answer is no. Data Quality Services come out of the box with Microsoft SQL Server 2012, which means no need to invest in extra expensive software. Furthermore, you do not need to introduce DQS wall to wall at once. You can start with one dimension or even a few attributes to get some experience and then expand the use to other areas. The DQS client is so easy and intuitive to work with that you do not need to know about development or SQL to build a Knowledge base and clean your data. In fact, the recommendation is that the “data steward” you point out as responsible for the knowledge base is more of a business representative than an IT one.
Automating knowledge base update
Ok, so now you have created a knowledge base and a data cleansing process. How do you automate it so the knowledge base updates automatically as time goes by? Well, you can automate your data cleansing process a long way. You can implement it to your existing SQL Server Integration Services packages and automatically have new records loaded that fit in your predefined rules. However, updating the knowledge base and handling the records rejected according to rules is a manual and ongoing process. The process often needs a person knowing the business to make the right corrections. For instance, cleaning customer data might involve sales persons who actually know the customers. That is also why, the person pointed out as “Data Steward” or responsible for the knowledge base should be one outside the IT department.
So now, I have just been touching some of the questions that a new user might have regarding Microsoft Data Quality Services. If you have further questions about Data Quality Services, Master Data Services or Data Warehousing in general you are more than welcome to ask. Platon A/S is the leading provider in Scandinavia of Data Warehouse, MDS and data cleansing solutions. We are independent from software providers so even though this post is about Microsoft Data Quality Services we always base our recommendations on customer needs rather than a specific tool.