Automated Data Validation, Data Quality and Unit Test Case Framework

There might be several cases where post completion of data loads certain data validation and quality checks needs to done to ensure we are delivering quality data for reporting and decision support and analytical team. The manual way of doing this is tedious job and if you want to buy some data quality tool for this job it would be extra $ spent which doesn't add much value to the end product.

Here using simple table and stored proc we are designing a simple and powerful application that does data validation and provide complete details about the data inconsistency in your application. Also there is feature which runs data correction scripts to correct the data issue well before some down stream application uses it. The complete setup is reusable and can be used across any relational db platform.

Design:
We need three object to achieve this process 

  1. Metadata Table: Here we will store all our validation scripts and the correction scripts that has to be executed on any data issue, also other details of date and email addresses who has to be notified on data issues.
  2. Stored Procedure to execute each test cases: This sproc will execute each test case individually and capture the results which also takes necessary actions to correct the data discrepancies.
  3. Stored Procedure to execute in batch: There will n no of test cases for each tables and we can't execute each one manually as its tedious job to do. So this sp will collectively execute all test cases based on input argument passed to it in a dynamic way. 
Lets jump into the code now 

Here I will store all my metadata in one table called Table_Validation_Script


USE TESTDB
GO

IF EXISTS (
SELECT *
FROM SYS.OBJECTS
WHERE TYPE = 'U'
AND NAME = 'Table_Validation_Script'
)
DROP TABLE Table_Validation_Script

GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].Table_Validation_Script(
[UT_NO] BIGINT NOT NULL IDENTITY(1,1),
[UT_DESC] [nvarchar](max) NULL,
[DB_NM] [nvarchar](100) NOT NULL,
[SCHEMA_NM] [nvarchar](100) NOT NULL,
[TABLE_NM] [nvarchar](100) NULL,
[UT_SQL] [nvarchar](MAX) NULL,
[FX_SQL] [nvarchar](MAX) NULL,
[DL] [nvarchar](100) NULL,
[RESULT] [nvarchar](4) default ('FAIL'),
[EXEC_DT] [DATETIME2](3) DEFAULT ('1900-01-01'),
[DW_TIMESTAMP] [DATETIME2](3),
[CREATED_BY] [nvarchar](50)
 CONSTRAINT [PK_Table_Validation_Script] PRIMARY KEY CLUSTERED 
(
[UT_NO] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 85) ON [PRIMARY]
) ON [PRIMARY]


Table layout looks something like below


Here you script design should be like UT_SQL should either return no rows on success, if data issues are there it should return the error causing records in output

Sample query for UT_SQL is below

Case: Check duplicates in the table

SELECT testcol1, 
       testcol2, 
       Count(*) 
FROM   testdb.dbo.testtable WITH (nolock) 
GROUP  BY testcol1, 
          testcol2 
HAVING Count(*) > 1 

Now your FX_SQL design is simple either a T-SQL or a proc which can be executed to fix the data issue, in the above case when we have found duplicates we need to delete those dup records from our table, So below is sample fix sql for it 


WITH dup 
     AS (SELECT testcol1, 
                testcol2, 
                Row_number() 
                  OVER ( 
                    partition BY testcol1, testcol2 
                    ORDER BY dw_timestamp) AS RNM 
         FROM   testdb.dbo.testtable WITH (nolock)) 
DELETE FROM dup 
WHERE  rnm > 1 


Now is that metadata table is ready lets run these scripts in a automated execution via a simple execution of stored procedure.

Stored Procedure to execute the sql is below, This sp will execute each single test case on success updates status in validation table as pass, on failure creates a .csv file for error causing record and send out a email to support team about the failures and runs the fix sql to fix the data issue. The below script is self explanatory.


USE TESTDB
GO

IF EXISTS (
SELECT *
FROM SYS.OBJECTS
WHERE TYPE = 'P'
AND NAME = 'udsp_EXEC_VALIDATION_SCRIPT'
)
DROP PROCEDURE udsp_EXEC_VALIDATION_SCRIPT
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_WARNINGS ON
GO
SET ANSI_NULLS ON
GO
/*
CREATED BY: Niranjan Raikar
CREATED ON: 05/26/2019
DESCRIPTION: The sp run the validation script on failure will send out a email to concern team about the failure
and also suggests the fix script that has to be executed to fix the data issue

NOTE: Insert your validation and fix sql in TESTDB.dbo.Table_Validation_Script

USAGE:
/*********************************************************************************************************************************/
DECLARE @1 bigint 
DECLARE @2 NVARCHAR(MAX)
DECLARE @3 NVARCHAR(MAX)
DECLARE @4 NVARCHAR(MAX)
DECLARE @5 NVARCHAR(MAX)
DECLARE @6 NVARCHAR(MAX)
DECLARE @7 NVARCHAR(MAX)
DECLARE @8 NVARCHAR(100)
select
    @1 = UT_NO,
@2 = UT_DESC,
    @3 = DB_NM,
@4 = SCHEMA_NM,
    @5 = TABLE_NM,
@6 = UT_SQL,
@7 = FX_SQL,
@8 = DL
from SSIS_LoggingDB.dbo.Table_Validation_Script with (nolock)
WHERE UT_NO = 26

EXEC udsp_EXEC_VALIDATION_SCRIPT @1, @2, @3, @4, @5, @6, @7, @8
/*********************************************************************************************************************************/
*/
CREATE PROC [dbo].udsp_EXEC_VALIDATION_SCRIPT @UT_NO BIGINT,@UT_DESC NVARCHAR(MAX),@DB_NAME NVARCHAR(100),@SCHEMA_NAME NVARCHAR(100),@TBL_NAME NVARCHAR(100),@UT_SQL NVARCHAR(MAX), @FX_SQL NVARCHAR(MAX), @DL nvarchar(100)
AS
BEGIN
DECLARE @SERVERNM VARCHAR(200) = SUBSTRING(@@SERVERNAME,1,CHARINDEX('\',@@SERVERNAME)-1)
DECLARE @bcpCommand varchar(8000);
DECLARE @ERROR_FNAME NVARCHAR(100) = 'S:\datawarehouse\Validation_Output\';
DECLARE @DEL_FNAME NVARCHAR(100);
DECLARE @FAILSUB NVARCHAR(MAX) = '['+@SERVERNM+']' +'-['+@DB_NAME+'].['+@SCHEMA_NAME+'].['+@TBL_NAME+']-[UT_NO:'+CAST(@UT_NO AS nvarchar(30))+']-'+'FAILED'
DECLARE @BDY NVARCHAR(MAX) = '/*---Below validation script has failed---*/' + CHAR(10) +CHAR(10)+ @UT_SQL + CHAR(10) + CHAR(10) +'/*---Executed the below fix SQL to correct the data issue---*/' + CHAR(10)+CHAR(10) + @FX_SQL + CHAR(10)+CHAR(10) + '/*---Please find attached error ouput---*/'

EXEC sp_executesql @UT_SQL
IF @@ROWCOUNT <> 0
BEGIN
UPDATE Table_Validation_Script
SET RESULT = 'FAIL',
EXEC_DT = GETDATE()
WHERE UT_NO = @UT_NO

SET @ERROR_FNAME = @ERROR_FNAME+@DB_NAME+'_'+@TBL_NAME+'_'+'UT_NO_'+CAST(@UT_NO AS nvarchar(30))+'_'
SET @DEL_FNAME = 'if exist DEL '+@ERROR_FNAME+'*.csv'
EXEC master..xp_cmdshell @DEL_FNAME

DECLARE @SQL NVARCHAR(MAX)
Set @SQL = Replace(@UT_SQL,CHAR(13)+CHAR(10),' ');
SET @ERROR_FNAME = @ERROR_FNAME + convert(varchar(20),getdate(),112)+'.csv'
SET @bcpCommand = 'bcp "' + @SQL +'" queryout ' + @ERROR_FNAME + ' -S '+@SERVERNM+' -T -c'
EXEC master..xp_cmdshell @bcpCommand

PRINT CHAR(10) + '[UT_NO:'+CAST(@UT_NO AS nvarchar(30))+'] ' + @UT_DESC + CHAR(10)
PRINT @UT_SQL + CHAR(10) + CHAR(10) + 'Validation Script Failed'
EXEC msdb.dbo.sp_send_dbmail 
@profile_name='email profile',
@recipients=@DL,
@subject=@FAILSUB,
@file_attachments=@ERROR_FNAME,
@body =  @BDY;
EXEC sp_executesql @FX_SQL
END
ELSE 
BEGIN
UPDATE Table_Validation_Script
SET RESULT = 'PASS',
EXEC_DT = GETDATE()
WHERE UT_NO = @UT_NO
PRINT CHAR(10) + '[UT_NO:'+CAST(@UT_NO AS nvarchar(30))+'] ' + @UT_DESC + CHAR(10)
PRINT @UT_SQL + CHAR(10) + CHAR(10) + 'Validation Script Succeeded'
END
-- To keep only last 90 days of files
EXEC xp_cmdshell 'FORFILES /p S:\datawarehouse\Validation_Output /s /m *.csv /d -90 /c "CMD /C del /Q /F @FILE"'
END

Now we have a process which can run the UT test cases for us, but for a given table there might be 10 or 15 or many more validation scripts in that case we can't execute each one manually right so is the next procedure which will execute all the test cases based on table name and database name passed as input argument

Stored Procedure to execute in bulk is below, Below procedure is self explanatory 

USE TESTDB
GO

IF EXISTS (
SELECT *
FROM SYS.OBJECTS
WHERE TYPE = 'P'
AND NAME = 'udsp_BATCH_EXEC_VALIDATION_SCRIPT'
)
DROP PROCEDURE udsp_BATCH_EXEC_VALIDATION_SCRIPT
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_WARNINGS ON
GO
SET ANSI_NULLS ON
GO
/*
CREATED BY: Niranjan Raikar
CREATED ON: 05/28/2019
DESCRIPTION: The sp run will run all the validation queries for the table name passed as input argument in batch mode 

USAGE:
EXEC udsp_BATCH_EXEC_VALIDATION_SCRIPT 'TESTDB','TESTTABLE'
*/
CREATE PROC [dbo].udsp_BATCH_EXEC_VALIDATION_SCRIPT @DB_NM NVARCHAR(100),@TABLE_NM NVARCHAR(100)
AS

DECLARE @COUNTER INT;
DECLARE @LOOP INT
DECLARE @FIN_RESULT INT

DROP TABLE IF EXISTS #VAL_SQL
SELECT UT_NO,ROW_NUMBER() OVER (ORDER BY UT_NO) AS RNM
INTO #VAL_SQL
FROM TESTDB.dbo.Table_Validation_Script 
WHERE DB_NM = @DB_NM 
AND TABLE_NM = @TABLE_NM

SELECT @COUNTER = MAX(RNM),
@LOOP = MIN(RNM) FROM #VAL_SQL

WHILE (@LOOP <= @COUNTER)
BEGIN
DECLARE @1 bigint 
DECLARE @2 NVARCHAR(MAX)
DECLARE @3 NVARCHAR(MAX)
DECLARE @4 NVARCHAR(MAX)
DECLARE @5 NVARCHAR(MAX)
DECLARE @6 NVARCHAR(MAX)
DECLARE @7 NVARCHAR(MAX)
DECLARE @8 NVARCHAR(100)
select
@1 = UT_NO,
@2 = UT_DESC,
@3 = DB_NM,
@4 = SCHEMA_NM,
@5 = TABLE_NM,
@6 = UT_SQL,
@7 = FX_SQL,
@8 = DL
from TESTDB.dbo.Table_Validation_Script with (nolock)
WHERE UT_NO = (SELECT UT_NO FROM #VAL_SQL WHERE RNM = @LOOP)
EXEC udsp_EXEC_VALIDATION_SCRIPT @1, @2, @3, @4, @5, @6, @7, @8
SET @LOOP = @LOOP + 1
END

BEGIN
SELECT @FIN_RESULT = COALESCE(COUNT(RESULT),0) FROM TESTDB.dbo.Table_Validation_Script V WITH (NOLOCK)
INNER JOIN #VAL_SQL S ON S.UT_NO = V.UT_NO
   WHERE V.RESULT = 'FAIL'
IF @FIN_RESULT <> 0 
THROW 50000, 'Validation did not Succeed', 1;
END
GO

This is how it works below


So by using these three objects you can very well automate any number of test cases execution, this will help you to automate data validation process and auto fix the data issues and report the failures to concerned team.

TIP 1: UT_SQL and FIX_SQL is all your responsibility, you can create as complex sql you can and create fix sql. In my case fix sql wasn't fitting in the table, So I have created reusable procedures over there for data correction.

TIP 2: When error records where more the error file size also grew big and I was not able to send them as attachment due to restrictions from email profile, So I would suggest use top 100 such limits while creating UT_SQL.










Comments

Popular posts from this blog

Microsoft Power Automate -- Timeout Limitation of 120 Sec in SQL Actions -- Solution

How to parse XML response from REST API in Azure Data Factory

Power BI Usage Metrics with the help of Power Automate - PBI Analytics