[SQL Tips] Clean Date Format inconsistency using Regex in Presto SQL
date format inconsistent sql
สารบัญ
Background
Have you ever come across date column in the table that has a different format in the same column? some row has DD/MM/YYYY. Some has YYYY-MM-DD then when you want to cast a date function, it causes an error due to its inconsistent within that particular date column. Surely, this becomes a common classic problem. It may be due to that date column data is filled by human so the format is inconsistent each time they fill the value.
How to fix this? + Code
This depends on each SQL environment, but here Presto SQL which many companies that I join use, has a REGEX_EXP(capture certain Regular Expression pattern) and SPLIT_PART (extract data inside with a specified separator) to use, thus I use these functions to help check each row and convert them to another congruent format
Approach Here I detect the possible formats, which are D/M/Y or M/D/Y in the table using CASE WHEN + Regex Expression. After that, I convert all of them to YYYY-MM-DD the most coolest format in my opinion lol
See the example code below. Note that this example doesn’t cover all the date format cases, but it works for my small table. If your data has more format then you should include additional CASE WHEN conditions to handle them.
Plus, if your data has other separators e.g. from / to – then you should add that in your conditions as well.
--1) if middle part is 1-12 and first part >12. Input:DD/MM/YYYY (\d{1,2}\/\d{1,2}\/\d{4})
-- \d{1,2} to check if this date has 1 to 2 digits
--\d{4} to check if this date has exact 4 digits
--SPLIT_PART to split the data inside using certain separator (in this case is /)
case
when (regexp_like(date_field,'\d{1,2}\/\d{1,2}\/\d{4}') = true
and cast(split_part(date_field,'/',2) as int) between 1 and 12
and cast(split_part(date_field,'/',1) as int) >12
)
then
cast(date_format(date_parse(date_field,'%d/%m/%Y'),'%Y-%m-%d') as date)
--2) if middle part is >12 and first part is 1-12. Input:MM/DD/YYYY
when (regexp_like(date_field,'\d{1,2}\/\d{1,2}\/\d{4}') = true
and cast(split_part(date_field,'/',1) as int) between 1 and 12
and cast(split_part(date_field,'/',2) as int) >12
) --if middle part is 1-12 and first part >12
then
cast(date_format(date_parse(date_field,'%m/%d/%Y'),'%Y-%m-%d') as date)
--3) other cases DD/MM/YYYY e.g. 9/9/2022
when regexp_like(date_field,'\d{1,2}\/\d{1,2}\/\d{4}') = true
then
cast(date_format(date_parse(date_field,'%d/%m/%Y'),'%Y-%m-%d') as date)
--4) case YYYY-MM-DD
when regexp_like(date_field,'\d{4}-\d{1,2}-\d{1,2}') = true then
cast(date_field as date)
end as date_reformat
Final Output
date_field | date_reformat |
9/9/2022 | 2022-09-09 |
14/9/2022 | 2022-09-14 |
1/15/2022 | 2022-01-15 |
Limitation
Even though it looks work on paper, but in practice, it is better to validate the date format first before loading the data to the database. Nonetheless, it is possible that there is a legacy from the past that allows these unclean data to get in, so we have to fix them afterwards. 😭😭
There is an ambiguous value e.g. 09/12/2022 that we may not know which pattern it should have been between MM/DD/YYYY or DD/MM/YYYY. If so, we have to make an agreement with a person you work with that if there is an ambiguous data format coming in, we will force them to a default date pattern either YYYY-MM-DD or YYYY-DD-MM.
Hope it helps
Related blog posts
Most of my blogs are in Thai but there is an Google Translate option inside the blog:
📊 Data Analytics — Tech, Programming Tutorial blog
🏀 Sport | 📺 Anime Review |🎧 Music | 🎬 Film Series | 📚 Book Review
date format inconsistent sql