Automated Data Validation, Data Quality and Unit Test Case Framework
There might be several cases where post completion of data loads certain data validation and quality checks needs to done to ensure we are delivering quality data for reporting and decision support and analytical team. The manual way of doing this is tedious job and if you want to buy some data quality tool for this job it would be extra $ spent which doesn't add much value to the end product.
Here using simple table and stored proc we are designing a simple and powerful application that does data validation and provide complete details about the data inconsistency in your application. Also there is feature which runs data correction scripts to correct the data issue well before some down stream application uses it. The complete setup is reusable and can be used across any relational db platform.
Design:
We need three object to achieve this process
Here using simple table and stored proc we are designing a simple and powerful application that does data validation and provide complete details about the data inconsistency in your application. Also there is feature which runs data correction scripts to correct the data issue well before some down stream application uses it. The complete setup is reusable and can be used across any relational db platform.
Design:
We need three object to achieve this process
- Metadata Table: Here we will store all our validation scripts and the correction scripts that has to be executed on any data issue, also other details of date and email addresses who has to be notified on data issues.
- Stored Procedure to execute each test cases: This sproc will execute each test case individually and capture the results which also takes necessary actions to correct the data discrepancies.
- Stored Procedure to execute in batch: There will n no of test cases for each tables and we can't execute each one manually as its tedious job to do. So this sp will collectively execute all test cases based on input argument passed to it in a dynamic way.
Lets jump into the code now
Here I will store all my metadata in one table called Table_Validation_Script
USE TESTDB
GO
IF EXISTS (
SELECT *
FROM SYS.OBJECTS
WHERE TYPE = 'U'
AND NAME = 'Table_Validation_Script'
)
DROP TABLE Table_Validation_Script
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].Table_Validation_Script(
[UT_NO] BIGINT NOT NULL IDENTITY(1,1),
[UT_DESC] [nvarchar](max) NULL,
[DB_NM] [nvarchar](100) NOT NULL,
[SCHEMA_NM] [nvarchar](100) NOT NULL,
[TABLE_NM] [nvarchar](100) NULL,
[UT_SQL] [nvarchar](MAX) NULL,
[FX_SQL] [nvarchar](MAX) NULL,
[DL] [nvarchar](100) NULL,
[RESULT] [nvarchar](4) default ('FAIL'),
[EXEC_DT] [DATETIME2](3) DEFAULT ('1900-01-01'),
[DW_TIMESTAMP] [DATETIME2](3),
[CREATED_BY] [nvarchar](50)
CONSTRAINT [PK_Table_Validation_Script] PRIMARY KEY CLUSTERED
(
[UT_NO] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 85) ON [PRIMARY]
) ON [PRIMARY]
Table layout looks something like below
Here you script design should be like UT_SQL should either return no rows on success, if data issues are there it should return the error causing records in output
Sample query for UT_SQL is below
Case: Check duplicates in the table
SELECT testcol1,
testcol2,
Count(*)
FROM testdb.dbo.testtable WITH (nolock)
GROUP BY testcol1,
testcol2
HAVING Count(*) > 1
Now your FX_SQL design is simple either a T-SQL or a proc which can be executed to fix the data issue, in the above case when we have found duplicates we need to delete those dup records from our table, So below is sample fix sql for it
WITH dup
AS (SELECT testcol1,
testcol2,
Row_number()
OVER (
partition BY testcol1, testcol2
ORDER BY dw_timestamp) AS RNM
FROM testdb.dbo.testtable WITH (nolock))
DELETE FROM dup
WHERE rnm > 1
Now is that metadata table is ready lets run these scripts in a automated execution via a simple execution of stored procedure.
Stored Procedure to execute the sql is below, This sp will execute each single test case on success updates status in validation table as pass, on failure creates a .csv file for error causing record and send out a email to support team about the failures and runs the fix sql to fix the data issue. The below script is self explanatory.
USE TESTDB
GO
IF EXISTS (
SELECT *
FROM SYS.OBJECTS
WHERE TYPE = 'P'
AND NAME = 'udsp_EXEC_VALIDATION_SCRIPT'
)
DROP PROCEDURE udsp_EXEC_VALIDATION_SCRIPT
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_WARNINGS ON
GO
SET ANSI_NULLS ON
GO
/*
CREATED BY: Niranjan Raikar
CREATED ON: 05/26/2019
DESCRIPTION: The sp run the validation script on failure will send out a email to concern team about the failure
and also suggests the fix script that has to be executed to fix the data issue
NOTE: Insert your validation and fix sql in TESTDB.dbo.Table_Validation_Script
USAGE:
/*********************************************************************************************************************************/
DECLARE @1 bigint
DECLARE @2 NVARCHAR(MAX)
DECLARE @3 NVARCHAR(MAX)
DECLARE @4 NVARCHAR(MAX)
DECLARE @5 NVARCHAR(MAX)
DECLARE @6 NVARCHAR(MAX)
DECLARE @7 NVARCHAR(MAX)
DECLARE @8 NVARCHAR(100)
select
@1 = UT_NO,
@2 = UT_DESC,
@3 = DB_NM,
@4 = SCHEMA_NM,
@5 = TABLE_NM,
@6 = UT_SQL,
@7 = FX_SQL,
@8 = DL
from SSIS_LoggingDB.dbo.Table_Validation_Script with (nolock)
WHERE UT_NO = 26
EXEC udsp_EXEC_VALIDATION_SCRIPT @1, @2, @3, @4, @5, @6, @7, @8
/*********************************************************************************************************************************/
*/
CREATE PROC [dbo].udsp_EXEC_VALIDATION_SCRIPT @UT_NO BIGINT,@UT_DESC NVARCHAR(MAX),@DB_NAME NVARCHAR(100),@SCHEMA_NAME NVARCHAR(100),@TBL_NAME NVARCHAR(100),@UT_SQL NVARCHAR(MAX), @FX_SQL NVARCHAR(MAX), @DL nvarchar(100)
AS
BEGIN
DECLARE @SERVERNM VARCHAR(200) = SUBSTRING(@@SERVERNAME,1,CHARINDEX('\',@@SERVERNAME)-1)
DECLARE @bcpCommand varchar(8000);
DECLARE @ERROR_FNAME NVARCHAR(100) = 'S:\datawarehouse\Validation_Output\';
DECLARE @DEL_FNAME NVARCHAR(100);
DECLARE @FAILSUB NVARCHAR(MAX) = '['+@SERVERNM+']' +'-['+@DB_NAME+'].['+@SCHEMA_NAME+'].['+@TBL_NAME+']-[UT_NO:'+CAST(@UT_NO AS nvarchar(30))+']-'+'FAILED'
DECLARE @BDY NVARCHAR(MAX) = '/*---Below validation script has failed---*/' + CHAR(10) +CHAR(10)+ @UT_SQL + CHAR(10) + CHAR(10) +'/*---Executed the below fix SQL to correct the data issue---*/' + CHAR(10)+CHAR(10) + @FX_SQL + CHAR(10)+CHAR(10) + '/*---Please find attached error ouput---*/'
EXEC sp_executesql @UT_SQL
IF @@ROWCOUNT <> 0
BEGIN
UPDATE Table_Validation_Script
SET RESULT = 'FAIL',
EXEC_DT = GETDATE()
WHERE UT_NO = @UT_NO
SET @ERROR_FNAME = @ERROR_FNAME+@DB_NAME+'_'+@TBL_NAME+'_'+'UT_NO_'+CAST(@UT_NO AS nvarchar(30))+'_'
SET @DEL_FNAME = 'if exist DEL '+@ERROR_FNAME+'*.csv'
EXEC master..xp_cmdshell @DEL_FNAME
DECLARE @SQL NVARCHAR(MAX)
Set @SQL = Replace(@UT_SQL,CHAR(13)+CHAR(10),' ');
SET @ERROR_FNAME = @ERROR_FNAME + convert(varchar(20),getdate(),112)+'.csv'
SET @bcpCommand = 'bcp "' + @SQL +'" queryout ' + @ERROR_FNAME + ' -S '+@SERVERNM+' -T -c'
EXEC master..xp_cmdshell @bcpCommand
PRINT CHAR(10) + '[UT_NO:'+CAST(@UT_NO AS nvarchar(30))+'] ' + @UT_DESC + CHAR(10)
PRINT @UT_SQL + CHAR(10) + CHAR(10) + 'Validation Script Failed'
EXEC msdb.dbo.sp_send_dbmail
@profile_name='email profile',
@recipients=@DL,
@subject=@FAILSUB,
@file_attachments=@ERROR_FNAME,
@body = @BDY;
EXEC sp_executesql @FX_SQL
END
ELSE
BEGIN
UPDATE Table_Validation_Script
SET RESULT = 'PASS',
EXEC_DT = GETDATE()
WHERE UT_NO = @UT_NO
PRINT CHAR(10) + '[UT_NO:'+CAST(@UT_NO AS nvarchar(30))+'] ' + @UT_DESC + CHAR(10)
PRINT @UT_SQL + CHAR(10) + CHAR(10) + 'Validation Script Succeeded'
END
-- To keep only last 90 days of files
EXEC xp_cmdshell 'FORFILES /p S:\datawarehouse\Validation_Output /s /m *.csv /d -90 /c "CMD /C del /Q /F @FILE"'
END
Now we have a process which can run the UT test cases for us, but for a given table there might be 10 or 15 or many more validation scripts in that case we can't execute each one manually right so is the next procedure which will execute all the test cases based on table name and database name passed as input argument
Stored Procedure to execute in bulk is below, Below procedure is self explanatory
USE TESTDB
GO
IF EXISTS (
SELECT *
FROM SYS.OBJECTS
WHERE TYPE = 'P'
AND NAME = 'udsp_BATCH_EXEC_VALIDATION_SCRIPT'
)
DROP PROCEDURE udsp_BATCH_EXEC_VALIDATION_SCRIPT
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_WARNINGS ON
GO
SET ANSI_NULLS ON
GO
/*
CREATED BY: Niranjan Raikar
CREATED ON: 05/28/2019
DESCRIPTION: The sp run will run all the validation queries for the table name passed as input argument in batch mode
USAGE:
EXEC udsp_BATCH_EXEC_VALIDATION_SCRIPT 'TESTDB','TESTTABLE'
*/
CREATE PROC [dbo].udsp_BATCH_EXEC_VALIDATION_SCRIPT @DB_NM NVARCHAR(100),@TABLE_NM NVARCHAR(100)
AS
DECLARE @COUNTER INT;
DECLARE @LOOP INT
DECLARE @FIN_RESULT INT
DROP TABLE IF EXISTS #VAL_SQL
SELECT UT_NO,ROW_NUMBER() OVER (ORDER BY UT_NO) AS RNM
INTO #VAL_SQL
FROM TESTDB.dbo.Table_Validation_Script
WHERE DB_NM = @DB_NM
AND TABLE_NM = @TABLE_NM
SELECT @COUNTER = MAX(RNM),
@LOOP = MIN(RNM) FROM #VAL_SQL
WHILE (@LOOP <= @COUNTER)
BEGIN
DECLARE @1 bigint
DECLARE @2 NVARCHAR(MAX)
DECLARE @3 NVARCHAR(MAX)
DECLARE @4 NVARCHAR(MAX)
DECLARE @5 NVARCHAR(MAX)
DECLARE @6 NVARCHAR(MAX)
DECLARE @7 NVARCHAR(MAX)
DECLARE @8 NVARCHAR(100)
select
@1 = UT_NO,
@2 = UT_DESC,
@3 = DB_NM,
@4 = SCHEMA_NM,
@5 = TABLE_NM,
@6 = UT_SQL,
@7 = FX_SQL,
@8 = DL
from TESTDB.dbo.Table_Validation_Script with (nolock)
WHERE UT_NO = (SELECT UT_NO FROM #VAL_SQL WHERE RNM = @LOOP)
EXEC udsp_EXEC_VALIDATION_SCRIPT @1, @2, @3, @4, @5, @6, @7, @8
SET @LOOP = @LOOP + 1
END
BEGIN
SELECT @FIN_RESULT = COALESCE(COUNT(RESULT),0) FROM TESTDB.dbo.Table_Validation_Script V WITH (NOLOCK)
INNER JOIN #VAL_SQL S ON S.UT_NO = V.UT_NO
WHERE V.RESULT = 'FAIL'
IF @FIN_RESULT <> 0
THROW 50000, 'Validation did not Succeed', 1;
END
GO
This is how it works below
So by using these three objects you can very well automate any number of test cases execution, this will help you to automate data validation process and auto fix the data issues and report the failures to concerned team.
TIP 1: UT_SQL and FIX_SQL is all your responsibility, you can create as complex sql you can and create fix sql. In my case fix sql wasn't fitting in the table, So I have created reusable procedures over there for data correction.
TIP 2: When error records where more the error file size also grew big and I was not able to send them as attachment due to restrictions from email profile, So I would suggest use top 100 such limits while creating UT_SQL.
Comments
Post a Comment