add_months behavior in Hive and Spark vs Oracle
13 Aug 2018
The add_months logic seems to work differently in Oracle and the Big data ecosystem tools (Hive and Spark). It is probably because oracle’s date includes time as well, so it might be more equivalent to timestamp in hive/spark.
Oracle has the same behavior whether you use 1 month interval or add_months function.
select
current_date,
add_months(current_date,1) add_months,
current_date + interval '1' month interval_1_month
from dual;
+--------------------+--------------------+------------------------------+
| CURRENT_DATE| ADD_MONTHS|CURRENT_DATE+INTERVAL'1'MONTH |
+--------------------+--------------------+------------------------------+
+5/4/2018 12:07:13 AM|6/4/2018 12:07:13 AM|6/4/2018 12:07:13 AM |
+--------------------+--------------------+------------------------------+
In case of hive and spark, interval works well with timestamps as expected, but add_months is expected to work on dates. The timestamp, therefore is implicitly converted to date and a month is added. If you need to preserve the timestamps, the INTERVAL function is the only simple option.
Hive.
hive> select x, add_months(x,-1), x + interval 1 month x2 from (select cast('2018-01-01 11:25:35' as timestamp) x) src;
OK
2018-01-01 11:25:35 2017-12-01 2018-02-01 11:25:35
Time taken: 0.075 seconds, Fetched: 1 row(s)
Spark.
scala> spark.sql("select x, add_months(x,-1) x1, x + interval 1 month x2 from df").show()
+-------------------+----------+-------------------+
| x| x1| x2|
+-------------------+----------+-------------------+
|2018-01-01 11:25:36|2017-12-01|2018-02-01 11:25:36|
+-------------------+----------+-------------------+