[SQL Tips] Clean Date Format inconsistency using Regex in Presto SQL

Background

Have you ever come across date column in the table that has a different format in the same column? some row has DD/MM/YYYY. Some has YYYY-MM-DD then when you want to cast a date function, it causes an error due to its inconsistent within that particular date column. Surely, this becomes a common classic problem. It may be due to that date column data is filled by human so the format is inconsistent each time they fill the value.

How to fix this? + Code

This depends on each SQL environment, but here Presto SQL which many companies that I join use, has a REGEX_EXP(capture certain Regular Expression pattern) and SPLIT_PART (extract data inside with a specified separator) to use, thus I use these functions to help check each row and convert them to another congruent format

Approach Here I detect the possible formats, which are D/M/Y or M/D/Y in the table using CASE WHEN + Regex Expression. After that, I convert all of them to YYYY-MM-DD the most coolest format in my opinion lol

See the example code below. Note that this example doesn’t cover all the date format cases, but it works for my small table. If your data has more format then you should include additional CASE WHEN conditions to handle them.

Plus, if your data has other separators e.g. from / to – then you should add that in your conditions as well.

--1) if middle part is 1-12 and first part >12. Input:DD/MM/YYYY (\d{1,2}\/\d{1,2}\/\d{4})	 
-- \d{1,2} to check if this date has 1 to 2 digits	 
--\d{4} to check if this date has exact 4 digits	 
--SPLIT_PART to split the data inside using certain separator (in this case is /)	

case 	 
when (regexp_like(date_field,'\d{1,2}\/\d{1,2}\/\d{4}') = true	 
and cast(split_part(date_field,'/',2) as int) between 1 and 12	 
and cast(split_part(date_field,'/',1) as int) >12	 
)	 

then	 
cast(date_format(date_parse(date_field,'%d/%m/%Y'),'%Y-%m-%d') as date)	 

--2) if middle part is >12 and first part is 1-12. Input:MM/DD/YYYY	 
when (regexp_like(date_field,'\d{1,2}\/\d{1,2}\/\d{4}') = true	 
and cast(split_part(date_field,'/',1) as int) between 1 and 12	 
and cast(split_part(date_field,'/',2) as int) >12	 

) --if middle part is 1-12 and first part >12	 

then	 
cast(date_format(date_parse(date_field,'%m/%d/%Y'),'%Y-%m-%d') as date)	 

--3) other cases DD/MM/YYYY e.g. 9/9/2022	 

when regexp_like(date_field,'\d{1,2}\/\d{1,2}\/\d{4}') = true	 
then	 

cast(date_format(date_parse(date_field,'%d/%m/%Y'),'%Y-%m-%d') as date)	 

--4) case YYYY-MM-DD	 
when regexp_like(date_field,'\d{4}-\d{1,2}-\d{1,2}') = true then	 
cast(date_field as date)	 

end as date_reformat

Final Output

date_fielddate_reformat
9/9/20222022-09-09
14/9/20222022-09-14
1/15/20222022-01-15

Limitation

Even though it looks work on paper, but in practice, it is better to validate the date format first before loading the data to the database. Nonetheless, it is possible that there is a legacy from the past that allows these unclean data to get in, so we have to fix them afterwards. 😭😭

There is an ambiguous value e.g. 09/12/2022 that we may not know which pattern it should have been between MM/DD/YYYY or DD/MM/YYYY. If so, we have to make an agreement with a person you work with that if there is an ambiguous data format coming in, we will force them to a default date pattern either YYYY-MM-DD or YYYY-DD-MM.

Hope it helps


Related blog posts

Most of my blogs are in Thai but there is an Google Translate option inside the blog:

📊 Data Analytics — Tech, Programming Tutorial blog

🏀 Sport | 📺 Anime Review |🎧 Music | 🎬 Film Series | 📚 Book Review

Loading

GleeGM

My journal on personal life and interests including Data Analytics 📈, Books 📚, Music 🎶, Basketball 🏀, Figure Skating ⛸, Anime, Film 📺, Tarot, Lenormand, Uranian Astrology🔮

You may also like...