Preprocessing External Tables

How to use inline preprocessing on external tables to streamline data transformations

External tables is a feature that enables users to access data in a table format from a location that is external to the database. It has been widely used in data analytics environments, offers the advantage of eliminating the step to move data from source into landing/staging space, reduces the processing time and storage cost, and enables the connection from databases to Hadoop HDFS or Apache Hive tables.

There are times when the data files being processed are not in the desired format, for example the received data files are compressed and need to be unzipped or an alteration of the files might be needed to fit into a specific format.

It is possible to create an additional task such as a python or shell script to process the data files, but you have to make sure that the script is always executed before any read happens from external table in order to get the real time result. Rather than creating more adhoc tasks to address these needs, we can use the inline processing capability, PREPROCESSOR, to do the data transformation on the fly without having to adding more dependencies to the existing process.

Consider we are receiving a compressed file, storedata.gz, on a daily basis and storing it in Hadoop HFDS, we need to first unzip it to storedata.csv then adjust the raw data into a standard format before reading from an external table.

Here is how the data might look in storedata.csv:

Location Item Month Prices Units
STORE01/09S A 7/1/2020 10 2
STORE01S B 7/1/2020 15 1
STORE01S A 6/1/2020 30 2
STORE09S A 6/1/2020 20 1
STORE01S B 5/1/2020 10 1
STORE02S C 5/1/2020 5 1

Notice that the first row has two store number ‘STORE01/09S’ in one line. We want to split it into two records in order to easily aggregate to the location column and perform analysis.

Location Item Month Prices Units
STORE01S A 7/1/2020 10 2
STORE09S A 7/1/2020 10 2

Ideally, we want to get the desired output directly from the compressed data file received daily without having to run any other process each time there is a read from the external table.

Let’s begin with creating the necessary directory to be used by the external table. Note for testing purposes I’m only using one directory. For security, it’s recommended to keep executables and data in different directories, and group all required preprocessor tools into a dedicated directory then grant EXECUTE privileges to the users who need them.

CREATE OR REPLACE DIRECTORY hdfs_dir AS '/data/hdfs/data/';

CREATE OR REPLACE DIRECTORY exter_dir AS '/data/ora_exter_dir/'

We will then create a quick script storedata.ksh to decompress the data file and adjust the format.

#!/bin/ksh

Tempfile=temp_storedata.csv
L=1

/bin/gunzip -c $1 | while read line
do
    if [ ${L} -eq 1 ]
    then
            echo $line > $tempfile
            let L=L+1
            continue
    fi
    
    array=(${line//\,/ })
    total=${#array[@]}
    lp=""
    rp=""
    Col=1
    for i in "${!array[@]}"; do
        each=${array[i]}
        if [ ${Col} -eq 1 ]
        then
            newarray=(${each//\// })
            newtotal=${#newarray[@]}
            if [ $newtotal -eq 2 ]
            then
                lpp=${newarray[0]}
                rpp=${newarray[1]}
                lp=${lpp}S
                rp=${lpp: 0 :${#lpp}-2}${rpp}
            else
                echo $line >> $tempfile
                let Col=Col+1
                continue 2
            fi
        else
            lp=$lp","${array[i]}
            rp=$rp","${array[i]}
        fi
        let Col=Col+1
    done
        
    echo $lp >> $tempfile
    echo $rp >> $tempfile
    
    let L=L+1
    
done
/bin/cat $tempfile
/bin/rm -f $tempfile

The final remaining step is to create the external table, you can see we used PREPROCESSOR script storedata.ksh at here.

CREATE TABLE TEST.STORE_DATA
(
    LOCATION     VARCHAR2 (100 BYTE),
    ITEM         VARCHAR2 (100 BYTE),
    MONTH        VARCHAR2 (100 BYTE),
    PRICE        VARCHAR2 (100 BYTE),
    UNITS        VARCHAR2 (100 BYTE)
)
ORGANIZATION EXTERNAL
    (TYPE ORACLE_LOADER
                DEFAULT DIRECTORY exter_dir
                    ACCESS PARAMETERS (
                        RECORDS DELIMITED BY NEWLINE
                        PREPROCESSOR 'storedata.ksh'
                        SKIP 1
                        FIELDS
                            TERMINATED BY ','
                        MISSING FIELD VALUES ARE NULL
                    )
                LOCATION (nfs_dir:'storedata.gz'))
    REJECT LIMIT 0;

Let’s check the content of this table.

Location Item Month Prices Units
STORE01 A 7/1/2020 10 2
STORE09S A 7/1/2020 10 2
STORE01S B 7/1/2020 15 1
STORE01S A 6/1/2020 30 2
STORE09S A 6/1/2020 20 1
STORE01S B 5/1/2020 10 1
STORE02S C 5/1/2020 5 1

This is only one small demonstration of the power of PREPROCESSOR in external tables. You can apply it to almost any type of prior processing required to fit your needs, without introducing additional dependencies while saving storage space.

What will your next project be?

Share This