Preprocessing External Tables
How to use inline preprocessing on external tables to streamline data transformations
by Shuang Ding
External tables is a feature that enables users to access data in a table format from a location that is external to the database. It has been widely used in data analytics environments, offers the advantage of eliminating the step to move data from source into landing/staging space, reduces the processing time and storage cost, and enables the connection from databases to Hadoop HDFS or Apache Hive tables.
There are times when the data files being processed are not in the desired format, for example the received data files are compressed and need to be unzipped or an alteration of the files might be needed to fit into a specific format.
It is possible to create an additional task such as a python or shell script to process the data files, but you have to make sure that the script is always executed before any read happens from external table in order to get the real time result. Rather than creating more adhoc tasks to address these needs, we can use the inline processing capability, PREPROCESSOR, to do the data transformation on the fly without having to adding more dependencies to the existing process.
Consider we are receiving a compressed file, storedata.gz
, on a daily basis and storing it in Hadoop HFDS, we need to first unzip it to storedata.csv
then adjust the raw data into a standard format before reading from an external table.
Here is how the data might look in storedata.csv
:
Location | Item | Month | Prices | Units |
STORE01/09S | A | 7/1/2020 | 10 | 2 |
STORE01S | B | 7/1/2020 | 15 | 1 |
STORE01S | A | 6/1/2020 | 30 | 2 |
STORE09S | A | 6/1/2020 | 20 | 1 |
STORE01S | B | 5/1/2020 | 10 | 1 |
STORE02S | C | 5/1/2020 | 5 | 1 |
Notice that the first row has two store number ‘STORE01/09S’ in one line. We want to split it into two records in order to easily aggregate to the location column and perform analysis.
Location | Item | Month | Prices | Units |
STORE01S | A | 7/1/2020 | 10 | 2 |
STORE09S | A | 7/1/2020 | 10 | 2 |
Ideally, we want to get the desired output directly from the compressed data file received daily without having to run any other process each time there is a read from the external table.
Let’s begin with creating the necessary directory to be used by the external table. Note for testing purposes I’m only using one directory. For security, it’s recommended to keep executables and data in different directories, and group all required preprocessor tools into a dedicated directory then grant EXECUTE privileges to the users who need them.
CREATE OR REPLACE DIRECTORY hdfs_dir AS '/data/hdfs/data/'; CREATE OR REPLACE DIRECTORY exter_dir AS '/data/ora_exter_dir/'
We will then create a quick script storedata.ksh
to decompress the data file and adjust the format.
#!/bin/ksh Tempfile=temp_storedata.csv L=1 /bin/gunzip -c $1 | while read line do if [ ${L} -eq 1 ] then echo $line > $tempfile let L=L+1 continue fi array=(${line//\,/ }) total=${#array[@]} lp="" rp="" Col=1 for i in "${!array[@]}"; do each=${array[i]} if [ ${Col} -eq 1 ] then newarray=(${each//\// }) newtotal=${#newarray[@]} if [ $newtotal -eq 2 ] then lpp=${newarray[0]} rpp=${newarray[1]} lp=${lpp}S rp=${lpp: 0 :${#lpp}-2}${rpp} else echo $line >> $tempfile let Col=Col+1 continue 2 fi else lp=$lp","${array[i]} rp=$rp","${array[i]} fi let Col=Col+1 done echo $lp >> $tempfile echo $rp >> $tempfile let L=L+1 done /bin/cat $tempfile /bin/rm -f $tempfile
The final remaining step is to create the external table, you can see we used PREPROCESSOR script storedata.ksh
at here.
CREATE TABLE TEST.STORE_DATA ( LOCATION VARCHAR2 (100 BYTE), ITEM VARCHAR2 (100 BYTE), MONTH VARCHAR2 (100 BYTE), PRICE VARCHAR2 (100 BYTE), UNITS VARCHAR2 (100 BYTE) ) ORGANIZATION EXTERNAL (TYPE ORACLE_LOADER DEFAULT DIRECTORY exter_dir ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE PREPROCESSOR 'storedata.ksh' SKIP 1 FIELDS TERMINATED BY ',' MISSING FIELD VALUES ARE NULL ) LOCATION (nfs_dir:'storedata.gz')) REJECT LIMIT 0;
Let’s check the content of this table.
Location | Item | Month | Prices | Units |
STORE01 | A | 7/1/2020 | 10 | 2 |
STORE09S | A | 7/1/2020 | 10 | 2 |
STORE01S | B | 7/1/2020 | 15 | 1 |
STORE01S | A | 6/1/2020 | 30 | 2 |
STORE09S | A | 6/1/2020 | 20 | 1 |
STORE01S | B | 5/1/2020 | 10 | 1 |
STORE02S | C | 5/1/2020 | 5 | 1 |
This is only one small demonstration of the power of PREPROCESSOR in external tables. You can apply it to almost any type of prior processing required to fit your needs, without introducing additional dependencies while saving storage space.