Importing data incrementally from RDBMS to hive/hadoop using sqoop -
i have oracle database , need import data hive table. daily import data size around 1 gb. better approach?
if import each day data partition, how can updated values handled?
for example, if imported today's data partition , next day there fields updated new values.
using --lastmodified
can values need send updated values new partition or old (already existing) partition?
if send new partition, data duplicated. if want send existing partition, how can achieved?
your option override entire existing partition 'insert overwrite table...'.
question - how far going updating data?
think of 3 approaches u can consider:
- decide on threshold 'fresh' data. example '14 days backwards' or '1 month backwards'.
each day running job, override partitions (only ones have updated values) backwards, until threshold decided.
~1 gb day should feasible.
data before decided time not guranteed 100% correct.
scenario relevant if know fields can changed time window after set. - make hive table compatible acid transactions, allowing updates on table.
split daily job 2 tasks: new data being written run day. updated data need run backwards. sqoop responsible new data. take care of updated data 'manually' (some script generates update statements) - don't use partitions based on time. maybe dynamic partitioning more suitable use case.it depends on nature of data being handled.
Comments
Post a Comment