Big Data Purging in Oracle

If you are in charge of an Oracle database that accumulates an incredible amount of data, there will likely come a time when you are faced with implementing a data purging solution as part of a data retention plan. Without one, the database will incur massive but unnecessary growth. The volume of data combined with its velocity (its rate of accumulation,) among other factors, are going to need to be considered and will drive the ultimate solution. Unfortunately, there is not an out of the box one size fits all approach to data purging, and if you have ever dealt with big data, you will quickly realize that there is an art to crafting the best solution that is going to perform well in your system, without introducing more headache.

In this article, my use of the term “big data” is not so much about data complexity, but rather data with high volume and high velocity; that is, an extreme rate at which data flows into a specific table, leading to a rapid accumulation of data. As per the subject of this article, we are working with data that only has value for a limited time. We are going to discuss an approach that I have recently taken within an Oracle database, and though as stated, there is no one-size-fits-all approach, I am hoping that I can provide some information and ideas to help inspire the approach that you

Our Scenario

Let’s say that as part of our database we have a basic table used to store application event data.

CREATE TABLE EVENT (
  EVENT_DT TIMESTAMP (6) DEFAULT SYSTIMESTAMP,
  EVENT_TYPE INTEGER,
  DESCRIPTION VARCHAR2(250 BYTE)
); 

CREATE INDEX EVENT_DT_IX
  ON EVENT (EVENT_DT);

This table is being fed an incredible amount of data events from hundreds of machines as business elements are processed. The data really only has value up to 30 days out, as is used mostly for troubleshooting and diagnosing issues in business flow. Volume is becoming a big problem, and it is determined that anything older than 30 days can be purged. Whether or not a relational database is the best approach to store this data is not the immediate concern (a topic for another time perhaps,) and we have to work within the confines of the existing system.

The Art of Deletion

In order to solve our problem, often the first approach that comes to mind is a simple DELETE such as:

DELETE FROM EVENT WHERE EVENT_DT < SYSDATE – 30;
COMMIT;

Which will be executed via some scheduled task, service, application event, etc. at a configured interval. Makes sense, so what is so hard about that? Jobs done, right? Well not by a long shot. This simplistic approach seems reasonable, especially since the EVENT_DT is indexed. If your dataset is small, and the database isn’t very active or has extended periods of time of non-activity, this might work just fine for you, but then again, if this was the case, you probably wouldn’t be here right now.

The more data you have in this table, coupled with a relentless velocity; the DELETE above will likely cause all sorts of blocking which could eventually bring your system to its knees.

When formulating a performant solution to this problem there are several things that need to be considered:

Limit the number of round trips from the application to the database.
Limit the number of context switches between PL/SQL and the database.
Any CRUD operations against the data need to be very fast (in our case SELECTs and DELETEs.) Just having the right indexing on the table isn’t going to be enough when dealing with massive amounts of data.

This is really where the art of deletion comes into play, and where you need to get those creative wheels spinning. The following is a solution that I came up with, which we will break down:

CREATE OR REPLACE PROCEDURE PURGE_OLD_EVENTS(fromTime TIMESTAMP, toTime TIMESTAMP, maxExecutionMinutes PLS_INTEGER)
IS
BEGIN 

DECLARE
  TYPE rowIDs_t IS TABLE OF rowid;
  rowIDs rowIDs_t;
  rowCount PLS_INTEGER := 1;
  oldestDateTime TIMESTAMP;
  minuteLater TIMESTAMP;
  terminateExecution PLS_INTEGER := 0;
  startExecution TIMESTAMP := SYSTIMESTAMP;
  stopExecution TIMESTAMP; 

  BEGIN

    stopExecution := startExecution + INTERVAL '60' SECOND * maxExecutionMinutes;
    oldestDateTime := fromTime; 

    WHILE oldestDateTime < toTime AND terminateExecution = 0 LOOP
      minuteLater := oldestDateTime + INTERVAL '60' SECOND;
      rowCount := 1;   

      IF minuteLater > toTime THEN
       minuteLater := toTime;
      END IF; 

      WHILE rowCount > 0 AND terminateExecution = 0 LOOP
       SELECT rowid BULK COLLECT INTO rowIDs
        FROM EVENT
        WHERE EVENT_DT BETWEEN oldestDateTime AND minuteLater
        FETCH FIRST 500 ROWS ONLY; 

        rowCount:=rowIDs.count; 

        FORALL INX IN 1 .. rowIDs.count
         DELETE FROM EVENT where rowid=rowIDs(INX); 

        COMMIT;  

        IF maxExecutionMinutes > 0 THEN
         IF (CAST(stopExecution AS DATE) < CAST(SYSTIMESTAMP AS DATE)) THEN
          terminateExecution := 1;
         END IF;
        END IF; 

      END LOOP; 

      SELECT MIN(EVENT_DT) INTO oldestDateTime FROM EVENT WHERE EVENT_DT >= minuteLater; 

      DBMS_SESSION.Sleep(0.05); 

    END LOOP;

  END;

END;

Logical Flow of PURGE_OLD_EVENTS

I wrote the procedure to DELETE all rows within the provided data range in “small” chunks, from oldest to newest, in a manner as to dramatically reduce and likely eliminate blocking. This is how it is accomplished:

Loop through the given date range in one-minute time blocks.
For each one-minute time block, we BULK COLLECT the ROWIDs of up to 500 rows at a time.
When a block of 500 rows (or rows adding up to the end of date range is reached) we utilize a FORALL loop to DELETE these rows in a single context switch.
The deletion is Committed.
The loop moves to the next block of 500 rows and repeat #3 & #4, this is performed until we reach the end of our COLLECTION, or until it is determined that enough time has transpired that it is time to stop processing.
Sleep for a fraction of a second, to allow other operations waiting in the pipe to execute.
Move to the next one-minute time block and repeat starting at #2, until we ultimately hit the end of our date range, or until it is determined that enough time has transpired, that it is time to stop processing.

Let’s take a deeper look at elements of the PURGE_OLD_EVENTS procedure to see what we are doing, and why.

The Parameters

In my procedure I accept three parameters, which will allow the calling application (in my case, a Java Spring Boot service) to specify what the procedure purges, and how long it will run.

fromDate and toDate – These indicate the full range of date times of data that we want to purge. The calling process can determine what this range is.

maxExecutionMinutes – This isn’t necessary (and ads some additional complexity to the procedure,) but I generally like to put a mechanism like this into procedures where a complete execution might not be necessary, in addition to when a total execution time is difficult or impossible to predict (and could be very long running.) This allows the calling application to cleanly control the maximum amount of time the procedure is going to execute, which makes it much easier to schedule the procedure execution for frequent or continuous run scenarios.

The Process

BULK COLLECT and ROWID

Using BULK COLLECT we want to build a collection of identifiers for the rows that we want to delete. Primary keys are a good source, but what is even better is the value of the ROWID pseudocolumn. Every row inserted into the database is assigned a mutable value representing the address of a row in the table. Executing any query that uses the ROWID as means to fetch a row will be faster than fetching by a primary key value. Thus, when we are BULK COLLECTing the row identifiers, it will later be more performant to have access to the ROWIDs when executing the DELETEs. The entire reason for using a COLLECTION is so that we can use a FORALL loop, which will be covered shortly.

The following demonstrates bulk collecting relevant ROWIDs into a collection, which will be used later in a FORALL loop to delete the corresponding rows.

TYPE rowIDs_t IS TABLE OF rowid;
rowIDs rowIDs_t;

…

SELECT ROWID BULK COLLECT INTO rowIDs
  FROM EVENT
 WHERE EVENT_DT BETWEEN minDate AND minuteLater
 FETCH FIRST 500 ROWS ONLY;

Before we move on, the query used in our BULK COLLECT is critical to performance. When it comes to querying the table to obtain the data that we want to BULK COLLECT, it is best to keep the amount of data being collected relatively small, thus using a FETCH FIRST to limit the size.

BETWEEN

In order to make sure that our SELECT query is fast, we want to keep the range more focused on a small subset of data. This is where the BETWEEN clause is used to get us to a specific window of data. BETWEEN is more performant than using the greater-than/less-than operators to define a range.

This is where it is really helps to know your data. Tweaking the date range based on your test runs will be required so that this SELECT is fast. Depending on the general data velocity, your range might be an hour, or 20 minutes, or 10 seconds, etc. You don’t want to make the range so small that this BULK COLLECT process will need to be run more often than necessary, nor do you want a range that is too large which will slow the query down. In my particular solution, I discovered through testing, that fetching rows within single minute increments of time was ultimately the most performant.

FORALL

The whole reason for BULK COLLECTing in the first place, is so that we can utilize the FORALL loop. When it comes to bulk INSERTs, DELETEs, and UPDATEs, it is far more efficient to execute these within a FORALL as opposed to a standard FOR loop. The reason for this is that within a standard FOR, there will be a context switch between the PL/SQL and the database with each iteration. FORALL loops on the other hand, will execute all of the statements within a single context switch, thus dramatically speeding things up.

FORALL INX IN 1 .. rowIDs.count
    DELETE FROM EVENT where rowid=rowIDs(INX); 

COMMIT;

Incremental Commits

Incremental commits are a very important part of the purging process. When performing a large number of deletes, one way to dramatically reduce, if not eliminate blocking, is to perform incremental commits. The FORALL loop does not provide a native way to perform this task, so we have to be a little creative. This will naturally add a little more complexity to the PL/SQL code, and achieving incremental commits is the whole reason for my “WHILE rowCount” loop and corresponding logic. One of the keys to preventing blocking on very active tables is to perform tasks such as deletes in smaller quantities per transaction. Based on my experimentation, committing every 100, 500, or 1000 rows seems to perform well. Committing too often, or not often enough will affect things. This is another one of the variables that you will want to tweak in your testing, to find the most performant approach.

Take a Breath

A DBMS_SESSION.Sleep is thrown at the end of processing every data block. In layman’s terms, I consider this my way of letting the procedure take a breath, which will work wonders for allowing the DB to process other execution requests waiting in pipe within the same session.

maxExecutionMinutes Logic

As stated earlier, this is more of a bonus mechanism that I will often code where it could be useful. This doesn’t directly have an impact on performance but is used as a way to directly control how long you want the procedure to run. Without something like this and depending on just how massive your data is that needs to be deleted, you might fire up the procedure, and realize that it’s going to take a lot longer to run than expected. If that is the case, you are forced to terminate the procedure execution, which in my experience, isn’t always super clean (especially if you are using Oracle SQL Developer.) This logic is a way to cleanly have the procedure only run as long as a maximum number of minutes, and even if it has more to delete, it stops running. This is not only useful for testing your procedure but is very handy if you want to have an application that triggers this procedure to run on a specific interval. Let’s say you want to trigger the procedure to run every 10 minutes but want to give the procedure a 1-minute break between executions. You could pass a value of 9 in for this parameter. Taking an approach such as this, you could essentially create a continuously running purge job.

Testing The Purge Procedure

Testing your purge procedure is key to making it performant, but you cannot test the true performance of your solution without some serious data to test it against. Let’s populate that table with a bunch of test data in a specific date/time range that will be used to simulate the amount of data that might accumulate in our table in production.

Let’s say that we want to insert 250,000 rows of test data, to all fall within a single minute of time, we will calculate 60 seconds / 250,000 rows. This will give us a time stamp increment of 0.00024 seconds. The following block of PL/SQL will generate these 250,000 rows, from 12:00 AM on January 1, 2022, through 12:01 AM. The timestamps for all the rows will be different, that is 0.00024 seconds apart. The variables can be modified to meet your data expectations.

DECLARE maxInsertBlocks INTEGER := 250;
maxInsertsPerBlock INTEGER := 1000;
quickSleepEveryXBlocks INTEGER := 10; 

nextTimeStamp TIMESTAMP := '01-JAN-22';
nextTimeStampIncrement INTERVAL DAY TO SECOND := INTERVAL '0.00024' SECOND; 

BEGIN

 FOR currentInsertBlock IN 1..maxInsertBlocks LOOP
   FOR currentInsert IN 1..maxInsertsPerBlock LOOP

    nextTimeStamp := TO_TIMESTAMP(nextTimeStamp) + nextTimeStampIncrement;
    INSERT INTO "EVENT" (EVENT_DT, EVENT_TYPE, DESCRIPTION)
    VALUES (nextTimeStamp, 1, 'This is a test');

   END LOOP;

   COMMIT;

   IF MOD(currentInsertBlock, quickSleepEveryXBlocks) = 0 THEN
    DBMS_LOCK.Sleep(0.05);
   END IF;

 END LOOP;

END;

On my machine this took about 36 seconds to execute. Having a one-minute time block worth of data, that is representative of what we would expect to see in production is nice, however in order to be able to test deletion efficiency, we probably want to have more like an hour’s worth of data, which is 60x what we just inserted.

If we want to expand those 250,000 rows, so that we have roughly that same number of rows per minute over the span of 1 hour, we could add an additional loop to the code above, however, an hour’s worth of rows will equate to 15-Million rows generated, thus this will equate to a significant number of context switches between the PL/SQL and the database, making total execution time substantial.

One way to speed things up, is to reduce context switching, by utilizing a BULK COLLECT and FORALL, just like we did in our PURGE_OLD_EVENTS procedure. Secondly, a fast way that we can generate the base data that can be BULK COLLECTed and used to generate the rest our data, is to write a peculiar looking query using the LEVEL pseudocolumn, with the CONNECT BY clause, on the DUAL dummy table. These LEVEL queries are commonly used as way to generate a sequence of numbers.

The PL/SQL below dramatically reduces context switching, and is highly configurable, so that you can specify the number of hours you want to generate data for, how many rows you want generated for each minute, what date you want the timestamps to begin at, and how often you want to commit the inserts. This will generate all the rows you need to simulate an expected volume, with timestamps spread equally across the entire dataset.

DECLARE
  TYPE rowTSs_t IS TABLE OF TIMESTAMP;
  rowIDs rowTSs_t;
  rowCount PLS_INTEGER := 1;
  beginRange PLS_INTEGER;
  endRange PLS_INTEGER;

  -- MODIFY THESE VALUES FOR YOUR ROW GENERATION -----
  hoursToGenerate PLS_INTEGER := 1;
  rowsPerMinute PLS_INTEGER := 25000;
  startDT DATE := '01-JAN-22';
  rowsPerCommit PLS_INTEGER := 1000;  
  -- -------------------------------------------------

  BEGIN  

   SELECT DT BULK COLLECT INTO rowIDs
    FROM (
        SELECT ROWNUM AS ID,
            CAST(startDT AS TIMESTAMP)
            + ((INTERVAL '60' SECOND / rowsPerMinute) * ROWNUM) AS DT
         FROM
          (SELECT level FROM DUAL CONNECT BY level <= rowsPerMinute)
         WHERE ROWNUM <= rowsPerMinute
       ) TestData;      

    FOR currentHour IN 0..hoursToGenerate-1 LOOP    
     FOR currentMinute IN 0..59 LOOP     

        beginRange := (rowsPerCommit * -1) + 1;
        endRange := 0;

        WHILE endRange < rowIDs.count LOOP
          beginRange := beginRange + rowsPerCommit;
          endRange := beginRange + rowsPerCommit;

          IF (endRange > rowIDs.count) THEN endRange := rowIDs.count; END IF;

          IF (beginRange < endRange) THEN
            FORALL INX IN beginRange .. endRange
              INSERT /*+ APPEND_VALUES */ INTO "EVENT" (EVENT_DT, EVENT_TYPE, DESCRIPTION)
              VALUES (TO_TIMESTAMP(rowIDs(INX)
                + (INTERVAL '1' MINUTE * currentMinute)
                + (INTERVAL '1' HOUR * currentHour)),
                1, 'This is a test');

            COMMIT;
          END IF;

        END LOOP;

        DBMS_SESSION.Sleep(0.05);
    END LOOP;

  END LOOP;

 END;

Be patient, as inserting a large load of rows to test with isn’t going to be instantaneous, especially when generating rows in the tens of millions. This is where a lot of the tedium comes into play of getting your purge process tweaked for optimal performance.

Now, if you really want to put your purging procedure through the ringer (which I highly recommend,) you can run this same test data generation procedure at the same time the purging procedure is running and see how well things stand up.

Summary

The final part of the solution is determining the trigger in which the procedure is ultimately executed. This of course can be any number of possibilities available in your environment’s arsenal. In my case, the trigger point was a small Java Spring Boot service, configured to call the procedure at specific times, passing in calculated date ranges.

As demonstrated above, there really is an art to data purging in a high-volume high velocity system, and you have to identify the “sweet spot” to make your solution performant. A good solution doesn’t mean that it’s going to purge data at an insane rate, rather the ultimate goal is to drain the pond faster than it is accumulating. When you are draining faster than you are accumulating, eventually you will win. The question is, how long will it take to win? This of course is determined by just how deep that pond is at the point in which the draining process begins, but more importantly, how well you have tweaked things according to what you know about your data. I hope that I have given you some food for thought, and perhaps, the solution that I have provided here can be used as a building block for your needs.