Implementing SDC 2 with MD5/CRC32 function(CHECKSUM)

In general when requirement come to implement SCD type 2, we look out for any date field in the Target. If not date field we go for SCD type 2 with flag column, still we can go for SCD type two with versioning. This all scenario holds good when there is a date column or flag column in the table it’s easy for a developer to implement SCD type2.
Let’s say we come across a situation where there are no date columns or flag column, then how to implement SCD type 2. Usually we come across such situation when you are dealing with very old legacy system which doesn’t have such fields in the table. Then the solution is to generate hash code for the entire row and use it as identifier column for reference. To achieve this informatica allow us to use hash function namely MD5 () – Message Digest and CRC32 () – Cyclic Redundancy Check. The hash code in ETL perspective is referred as checksum value
Now we will try this in a simple example to see how does it works. I am considering a simple flat file and a target table for this scenario.

The above is the mapping for the same. So to begin with in expression transformation create two new output ports like below 


And edit the expression of the two new ports as below
MD5_CHECKSUM à MD5 (COL1||COL2|COL3||…..||COLn)
CRC32_CHECKSUM à CRC32(COL1||COL2|COL3||…..||COLn)
So how does it works, we are passing a concat value of all the fields to these hash function. It will create a unique hash value for these columns and we can use it for further unique key reference.
NOTE: It’s good to handle NULL values in the column which are used by NVL2() function to avoid any data discrepancies.
Next in look up transformation look up on the primary key for any changes and pass primary key and MD5_CHECKSUM and CRC32_CHECKSUM to a router transformation and populate rest fields from the source.






In Router create groups for insert n update rows as below
INSERT à ISNULL(LKP_PK)
UPDATE à NOT ISNULL(LKP_PK) AND  ( LKP_MD5_CHECKSUM != MD5_CHECKSUM AND LKP_CRC_CHECKSUM != CRC32_CHECKSUM )

Here you can use either of MD5 or CRC32 function, I have used both just for illustration. So once you have driven the records for INSERT and UPDATE rows perform SCD type 2 then.

NOTE: Tack care of Nulls and spaces in the column your are performing concat operation and passing it to MD5() function, Nulls create a big problem here as the value generated by MD5 won't be unique.

MD5(IIF(ISNULL(TRIM(COL1)),'A',TRIM(COL1))||IIF(ISNULL(TRIM(COL2)),'A',TRIM(COL2))||IIF(ISNULL(TRIM(COL3)),'A',TRIM(COL3)).....)
This way you can implement SCD type 2 without bothering about DATE or FLAG columns in the table.
Hope this helped you !!!!
Suggestions and corrections are always welcomed



Comments

  1. Hi,

    How will you create MD5 value for target data?
    i have understood that you have concatenated the source fields and used a MD5 function to generate a unique value MD5_CHECKSUM.
    But, how are we going to generate LKP_MD5_CHECKSUM in lookup transformation for target fields?

    ReplyDelete
  2. Hi Rahul,

    We can create MD5 for target in the look up t/f based on the look up condition.

    ReplyDelete
  3. We didn't not understand what is the lookup condition and where this LKP_MD5() handles??

    ReplyDelete
    Replies
    1. Lookup is to do the comparison between source MD5 hash value and target MD5 hash value, this is where we see if the incoming row is present in the target table, if present there is any change in that row data by comparing the hash value. Hash value will differ if there is any change in source data wrt target and further used to perform CDC at target.

      Delete
  4. in case of new record it performs insert. in case of update it updates the record. then how is it maintain the history as per SCD 2 definition??

    ReplyDelete
  5. MD5 is to identify if the input row is a new record or updated record. Using the the primary columns we can very well handle the history for in the table. Say if the input row is new then the primary column will also be entering the table for the first time so new insert. When the primary key is already present check for the MD5 value if the value is same with the existing record then there is no change in the source, if the MD5 value is not matching then flag the row for type 2 operation and save the history as needed.

    ReplyDelete

Post a Comment

Popular posts from this blog

Microsoft Power Automate -- Timeout Limitation of 120 Sec in SQL Actions -- Solution

How to parse XML response from REST API in Azure Data Factory

Power BI Usage Metrics with the help of Power Automate - PBI Analytics