Data Cleansing: Handle Non Printable and Special Characters – Informatica
Data cleansing is important aspect while performing ETL. There are many cases were we get non printable and special characters in the input file or table. Informatica Regular expression and function are very handy here.
So here is how we can avoid Non Printable and Special Characters when we are loading the target tables. Use REGREPLACE function for handling Non Printable characters and REPLACESTR for replacing multiple special characters in the field.
Examples:
Non Printable:
Syntax :
1. REG_REPLACE( subject, pattern, replace, numReplacements )
Argument
|
Required/Optional
|
Description
|
subject
|
Required String datatype.
|
Passes the string you want to search.
|
pattern
|
Required String datatype.
|
Passes the character string to be replaced. You must use perl compatible regular expression syntax. Enclose the pattern in single quotes.
|
replace
|
Required String datatype.
|
Passes the new character string. numReplacements Optional Numeric datatype. Specifies the number of occurrences you want to replace. If you omit this option, REG_REPLACE will replace all occurrences of the character string.
|
Special Characters:
There are some cases where we are asked for replacing special characters from the field we can think of REPLACECHR() function here, but it will replace only one specific character from the field. Whereas we need to replace a set of special characters from the input field.
So here we go for REPLACESTR() function which can handle multiple special characters from the input field.
Syntax:
Argument
|
Required/
Optional
|
Description
|
CaseFlag
|
Required
|
Must be an integer. Determines whether the arguments in this function are case sensitive. You can enter any valid transformation expression.
When CaseFlag is a number other than 0, the function is case sensitive.
When CaseFlag is a null value or 0, the function is not case sensitive.
|
InputString
|
Required
|
Must be a character string. Passes the strings you want to search. You can enter any valid transformation expression. If you pass a numeric value, the function converts it to a character string.
If InputString is NULL, REPLACESTR returns NULL.
|
OldString
|
Required
|
Must be a character string. The string you want to replace. You must enter at least one OldString argument. You can enter one or more characters per OldString argument. You can enter any valid transformation expression. You can also enter a text literal enclosed within single quotation marks, for example, 'abc'.
If you pass a numeric value, the function converts it to a character string.
When REPLACESTR contains multiple OldString arguments, and one or more OldString arguments is NULL or empty, REPLACESTR ignores the OldString argument. When all OldString arguments are NULL or empty, REPLACESTR returns InputString.
The function replaces the characters in the OldString arguments in the order they appear in the function. For example, if you enter multiple OldString arguments, the first OldString argument has precedence over the second OldString argument, and the second OldString argument has precedence over the third OldString argument. When REPLACESTR replaces a string, it places the cursor after the replaced characters in InputString before searching for the next match.
|
NewString
|
Required
|
Must be a character string. You can enter one character, multiple characters, an empty string, or NULL. You can enter any valid transformation expression.
If NewString is NULL or empty, REPLACESTR removes all occurrences of OldString in InputString.
|
Here REPLACESTR is very useful in replacing multiple characters from the input field. For instance take the expression REPLACESTR(1,PRODUCT_DESC,'"','.','?','#','+','/','!','^','~','`','$','')
Here it will replace all occurrence of special characters (".#?#+/$!~`) will be replaced with '' ie NULL
Like if PRODUCT_DESC is 'ABC~`DEF^%GH$%XYZ#!' the output of the expression will be 'ABCDEFGHXYZ'. This is how we can handle special characters from the input field.
By using REG_REPLACE and REPLACESTR we can take care of both non printable and special characters from the input field like below
REG_REPLACE(REPLACESTR(1,PRODUCT_DESC,'"','.','?','#','+','/','!','^','~','`','$',''),'[^[:print:]]','')
Happy learning!!!!
Correction and suggestions are always welcomed :)
Use REG_REPLACE(PRODUCT_DESC.'[^[:print]]',''), Over here the function looks for '[^[:print]]' which means its searching for non printable characters in the field which we are passing and its been replaced with '' (NULL).
Sometimes if we use REG_REPLACE(PRODUCT_DESC,'[^[:print]]',NULL), Informatica doesn't replace non printable with NULLS, so its better we use '' instead of NULL
If we have a characters like ¿ à  � Ó Ø in the field it will replace those with nulls, for instance if the PRODUCT_DESC field looks like this ABCD-�XYZ¿® C-PQRSTî it will cleansed and the output will be like ABCD-XYZ-CPQRST
REPLACESTR ( CaseFlag, InputString, OldString1, [OldString2, ... OldStringN,] NewString )
Important Note: Use relational connection with code page UTF-8 here
Hope this will help us all in data cleansing.
Very useful :)
ReplyDeletewww.quickdatacleansing.com try this site
Deletevery useful :) Thanks for sharing.
ReplyDeletei found it working great !
ReplyDeletethank you
Glad it worked for you :)
DeleteHey Niranjan,
ReplyDeleteThis blog is awesome.
I have one query, so REG_REPLACE can not handle special character, is it right?
Sorry for delayed reply, hope you have got the answers if not REG_REPLACE will handle special non printable characters depends on what arguments you pass to the function.
DeleteSaturam provides a platform to integrate your entire data infrastructure into one secure stronghold in order to provide you with greater control over your organization's data while increasing the ease of operating all aspects of your business. Analytics pipelines on your data lake will improve the efficiency of your entire organization while improving your control over your enterprise's data and the valuable, confidential data of your customers.
ReplyDeleteHappy to know this information was helpful to you.
ReplyDelete👍
ReplyDeleteSuper site! I am Loving it!! Will return once more, Im taking your food likewise, Thanks.data science course in malaysia
ReplyDeleteThanks for the nice blog. It was very useful for me. I'm happy I found this blog. Thank you for sharing with us,I too always learn something new from your post.data science course
ReplyDeleteThank you for the information.
ReplyDeleteAngular JS online training
Angular JS training
App V online training
App V training
Application packaging online training
Application packaging training
Blockchain online training
Blockchain training
C online training
C training
Data power online training
Data power training
Data Stage online training
Data Stage training
Dynamic CRM online training
Dynamic CRM training
Ethical hacking online training
Ethical hacking training
Set aside my effort to peruse all the remarks, however I truly delighted in the article. It's consistently pleasant when you can not exclusively be educated, yet in addition, engaged!
ReplyDeletedata science course in delhi
I looked at some very important and to maintain the length of the strength you are looking for on your website
ReplyDeletemasters in artificial intelligence
👍🏻
ReplyDeletenice post.SAP Bods training
ReplyDeleteSAP QM training
oracle soa training
oracle dba training
aws training
Thank you so much for throwing light on such an important topic, not sure if you are interested in 3rd party product but ZappySys has very easy solution.
ReplyDeleteLink here"
Informatica Read Json
Yes, I have came across ZappySys lately when I was working with SSIS to read data from API. Amazing solution
DeleteThis comment has been removed by a blog administrator.
ReplyDelete