Data Cleansing: Handle Non Printable and Special Characters – Informatica

Data cleansing is important aspect while performing ETL. There are many cases were we get non printable and special characters in the input file or table. Informatica Regular expression and function are very handy here.

So here is how we can avoid Non Printable and Special Characters when we are loading the target tables. Use REGREPLACE function for handling Non Printable characters and REPLACESTR for replacing multiple special characters in the field.

Examples:

Non Printable:

Syntax :

1.  REG_REPLACE( subject, pattern, replace, numReplacements )

Argument
Required/Optional
Description
subject
Required String datatype.
Passes the string you want to search.
pattern
Required String datatype.
Passes the character string to be replaced. You must use perl compatible regular expression syntax. Enclose the pattern in single quotes.
replace
Required String datatype.
Passes the new character string. numReplacements Optional Numeric datatype. Specifies the number of occurrences you want to replace. If you omit this option, REG_REPLACE will replace all occurrences of the character string.

Special Characters:
There are some cases where we are asked for replacing special characters from the field we can think of REPLACECHR() function here, but it will replace only one specific character from the field. Whereas we need to replace a set of special characters from the input field.
So here we go for REPLACESTR() function which can handle multiple special characters from the input field.
Syntax:  

Argument
Required/
Optional
Description
CaseFlag
Required
Must be an integer. Determines whether the arguments in this function are case sensitive. You can enter any valid transformation expression.
When CaseFlag is a number other than 0, the function is case sensitive.
When CaseFlag is a null value or 0, the function is not case sensitive.
InputString
Required
Must be a character string. Passes the strings you want to search. You can enter any valid transformation expression. If you pass a numeric value, the function converts it to a character string.
If InputString is NULL, REPLACESTR returns NULL.
OldString
Required
Must be a character string. The string you want to replace. You must enter at least one OldString argument. You can enter one or more characters per OldString argument. You can enter any valid transformation expression. You can also enter a text literal enclosed within single quotation marks, for example, 'abc'.
If you pass a numeric value, the function converts it to a character string.
When REPLACESTR contains multiple OldString arguments, and one or more OldString arguments is NULL or empty, REPLACESTR ignores the OldString argument. When all OldString arguments are NULL or empty, REPLACESTR returns InputString.
The function replaces the characters in the OldString arguments in the order they appear in the function. For example, if you enter multiple OldString arguments, the first OldString argument has precedence over the second OldString argument, and the second OldString argument has precedence over the third OldString argument. When REPLACESTR replaces a string, it places the cursor after the replaced characters in InputString before searching for the next match.
NewString
Required
Must be a character string. You can enter one character, multiple characters, an empty string, or NULL. You can enter any valid transformation expression.
If NewString is NULL or empty, REPLACESTR removes all occurrences of OldString in InputString.

Here REPLACESTR is very useful in replacing multiple characters from the input field. For instance take the expression REPLACESTR(1,PRODUCT_DESC,'"','.','?','#','+','/','!','^','~','`','$','') 


Here it will replace all occurrence of special characters (".#?#+/$!~`) will be replaced with '' ie NULL


Like if PRODUCT_DESC is 'ABC~`DEF^%GH$%XYZ#!' the output of the expression will be 'ABCDEFGHXYZ'. This is how we can handle special characters from the input field.
By using REG_REPLACE and REPLACESTR we can take care of both non printable and special characters from the input field like below


REG_REPLACE(REPLACESTR(1,PRODUCT_DESC,'"','.','?','#','+','/','!','^','~','`','$',''),'[^[:print:]]','')


Happy learning!!!!
Correction and suggestions are always welcomed :)


Use REG_REPLACE(PRODUCT_DESC.'[^[:print]]',''), Over here the function looks for '[^[:print]]' which means its searching for non printable characters in the field which we are passing and its been replaced with '' (NULL).

Sometimes if we use REG_REPLACE(PRODUCT_DESC,'[^[:print]]',NULL), Informatica doesn't replace non printable with NULLS, so its better we use '' instead of NULL

If we have a characters like ¿ Ã Â Ó Ø in the field it will replace those with nulls, for instance if the PRODUCT_DESC field looks like this ABCD-XYZ¿® C-PQRSTî it will cleansed and the output will be like ABCD-XYZ-CPQRST
REPLACESTR ( CaseFlag, InputString, OldString1, [OldString2, ... OldStringN,] NewString )
Important Note: Use relational connection with code page UTF-8 here Hope this will help us all in data cleansing.

Comments

  1. very useful :) Thanks for sharing.

    ReplyDelete
  2. i found it working great !
    thank you

    ReplyDelete
  3. Hey Niranjan,

    This blog is awesome.

    I have one query, so REG_REPLACE can not handle special character, is it right?

    ReplyDelete
    Replies
    1. Sorry for delayed reply, hope you have got the answers if not REG_REPLACE will handle special non printable characters depends on what arguments you pass to the function.

      Delete
  4. Saturam provides a platform to integrate your entire data infrastructure into one secure stronghold in order to provide you with greater control over your organization's data while increasing the ease of operating all aspects of your business. Analytics pipelines on your data lake will improve the efficiency of your entire organization while improving your control over your enterprise's data and the valuable, confidential data of your customers.

    ReplyDelete
  5. Happy to know this information was helpful to you.

    ReplyDelete
  6. Super site! I am Loving it!! Will return once more, Im taking your food likewise, Thanks.data science course in malaysia

    ReplyDelete
  7. Thanks for the nice blog. It was very useful for me. I'm happy I found this blog. Thank you for sharing with us,I too always learn something new from your post.data science course

    ReplyDelete
  8. Set aside my effort to peruse all the remarks, however I truly delighted in the article. It's consistently pleasant when you can not exclusively be educated, yet in addition, engaged!
    data science course in delhi

    ReplyDelete
  9. I looked at some very important and to maintain the length of the strength you are looking for on your website
    masters in artificial intelligence

    ReplyDelete
  10. Thank you so much for throwing light on such an important topic, not sure if you are interested in 3rd party product but ZappySys has very easy solution.
    Link here"

    Informatica Read Json

    ReplyDelete
    Replies
    1. Yes, I have came across ZappySys lately when I was working with SSIS to read data from API. Amazing solution

      Delete
  11. This comment has been removed by a blog administrator.

    ReplyDelete

Post a Comment

Popular posts from this blog

Microsoft Power Automate -- Timeout Limitation of 120 Sec in SQL Actions -- Solution

How to parse XML response from REST API in Azure Data Factory

Power BI Usage Metrics with the help of Power Automate - PBI Analytics