AVOID the VOID

I've recently been involved with extending a number of systems that have pre-built data models. I'm generally unhappy with these data models for a variety of reasons. There are many great academic texts on data modeling. I will try to put together a bibliography in an upcoming post. For now, I'll start by discussing the "never delete data" trend. It is generally coupled with the use of a column to indicate that the data should have been deleted (typically a column named VOID) but was instead allowed, indeed required, to linger forever in the table.

There are typically two arguments in favor of the "never delete, add a VOID column" data model: I want to know what happened from a traceability perspective, and, I want to be able to do incremental extracts to populate some other system and need to know if I need to VOID the row in the other system.

Example without VOID

It's easiest to deal with a concrete example, so let's make one. Assume we have an employee table that stores data about employees. For the purposes of my argument (and because it probably makes sense) let us assume we require a unique SSN for each employee. Typically this table would look like this:

CREATE table "EMPLOYEE" (
"ID" NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY,
"SSN" VARCHAR2(32) NOT NULL,
"LAST_NAME" VARCHAR2(128) NOT NULL,
"FIRST_NAME" VARCHAR2(128),
"SALARY" NUMBER,
"OTHER_INFO" VARCHAR2(256),
"LAST_UPDATED" DATE NOT NULL,
constraint "EMPLOYEE_PK" primary key ("ID")
)
/

alter table "EMPLOYEE" add
constraint "EMPLOYEE_SSN_UK"
unique ("SSN")
/

Because we want to be able to do incremental updates, we need the LAST_UPDATED column to be not null and we need to ensure it is always set correctly. There are many reasons to avoid triggers; just do an internet search for "Tom Kyte triggers" to see a number of valid arguments. For this purpose, though, I will add a trigger:

create or replace trigger "EMPLOYEE_BRIUT"
BEFORE
insert or update on "EMPLOYEE"
for each row
begin

:new.last_updated := sysdate;

end;

Note that I have told the database that SSN will be unique by adding EMPLOYEE_SSN_UK.
The database will automatically also create an unique index of the same name.

Let's explore what happens if two users attempt to insert employees with the same SSN.

User A (note lack of commit):
insert into employee (ssn, last_name)
values ('123456789', 'Smith');

1 rows inserted.

User B:
insert into employee (ssn, last_name)
values ('123456789', 'Smith');

....waiting....

User A:

commit;

committed.

User B:

Error starting at line : 1 in command -
insert into employee (ssn, last_name)
values ('123456789', 'Smith')
Error report -
SQL Error: ORA-00001: unique constraint (ANTON.EMPLOYEE_SSN_UK) violated
00001. 00000 - "unique constraint (%s.%s) violated"
*Cause: An UPDATE or INSERT statement attempted to insert a duplicate key.
For Trusted Oracle configured in DBMS MAC mode, you may see
this message if a duplicate entry exists at a different level.
*Action: Either remove the unique restriction or do not insert the key.

The database recognized that the SSN was a duplicate and did not allow User B to insert. Had User A issued a rollback, User B could have continued, but once User A had committed the record, User B received an error.

This functionality was completed with a single line of code:

alter table "EMPLOYEE" add constraint "EMPLOYEE_SSN_UK" unique ("SSN")

If I were coding an API and wanted to capture the error, it would require one additional line of code (assuming you already have the keyword EXCEPTION in your API):

exception
when DUP_VAL_ON_INDEX then ... do something

Later you realize that this person NEVER should have been entered into the employee table. This was not an employee, it was a customer. You issue the following command:

delete from employee
where ssn = '123456789';

commit;

Later still, this customer becomes an employee. You issue the following:

insert into employee (ssn, last_name)
values ('123456789', 'Smith');

commit;

This works just fine and no additional code is required.

Example With VOID

Let us assume that someone convinces you to disallow any deletes. Instead you are asked to add a VOID column. The VOID column will contain a V if the record is "void," else it will be null.

CREATE table "EMPLOYEE_NO_DELETE" (
"ID" NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY,
"SSN" VARCHAR2(32) NOT NULL,
"LAST_NAME" VARCHAR2(128) NOT NULL,
"FIRST_NAME" VARCHAR2(128),
"SALARY" NUMBER,
"OTHER_INFO" VARCHAR2(256),
"VOID" VARCHAR2(1),
"LAST_UPDATED" DATE NOT NULL,
constraint "EMPLOYEE_ND_PK" primary key ("ID")
)
/

create or replace trigger "EMPLOYEE_ND_BRIUT"
BEFORE
insert or update on "EMPLOYEE_NO_DELETE"
for each row
begin

:new.last_updated := sysdate;

end;
/

Given the scenario listed above, we won't be able to add the unique constraint on SSN. If we were to do so, we would not be able to add them employee the second time as there would already be an employee record with that same SSN. Perhaps we could get away with making SSN + VOID unique.

alter table "EMPLOYEE_NO_DELETE" add constraint "EMPLOYEE_ND_SSN_V_UK" unique ("SSN", VOID)
/

That seems to do the trick.

insert into employee_no_delete (ssn, last_name)
values ('123456789', 'Smith');

1 rows inserted.

update employee_no_delete
set void = 'V'
where ssn = '123456789';
1 rows updated.
insert into employee_no_delete (ssn, last_name)
values ('123456789', 'Smith');
1 rows inserted.
commit;

We still have all of the great features around row locking on uniqueness provided by the database.

Of course, if your users are anything like mine, you will find that Mr. Smith has once again been added as employee but he is really a customer. So...

update employee_no_delete
set void = 'V'
where ssn = '123456789';

Ah, but here we get

Error starting at line : 1 in command -
update employee_no_delete
set void = 'V'
where ssn = '123456789'
Error report -
SQL Error: ORA-00001: unique constraint (ANTON.EMPLOYEE_ND_SSN_V_UK) violated
00001. 00000 - "unique constraint (%s.%s) violated"
*Cause: An UPDATE or INSERT statement attempted to insert a duplicate key.
For Trusted Oracle configured in DBMS MAC mode, you may see
this message if a duplicate entry exists at a different level.
*Action: Either remove the unique restriction or do not insert the key.

We already have a row with this combination of SSN and VOID. There is no choice but to remove the constraint, and, along with it, all of the multi-user concurrency features provided by the database.

alter table "EMPLOYEE_NO_DELETE" drop constraint
"EMPLOYEE_ND_SSN_V_UK"
/

By dropping the constraint we lose more than we can possibly recover by adding our own code. Nevertheless, let's try.

Now we have to add code to ensure that two users (in two separate sessions) never insert or update rows to cause a duplicate SSN. That means we must insist that all updates happen through an API. You might argue that only updates that include the SSN must go through the API. There are edge cases where this could cause deadlocks--and more importantly, it would be difficult to allow updates to everything except SSN. Hence, we have a new rule: All updates must go through our API. This rule also means that, unless we code special additional APIs, all processing is row by row (AKA slow by slow). Want to give everyone a 10% raise? That means updating each row rather than issuing a single update. Our API is also somewhat complicated. We must ensure that there is only one insert or update that involves SSN at a time--across sessions. As we don't have much insight into the happenings of another session, we'll need some way to track this. In order to serialize the inserts and any updates that might change the SSN, we must lock the EMPLOYEE_NO_DELETE table--the whole table. This means before each insert or update we must issue

lock table employee_no_delete in share mode nowait;

We might consider using WAIT instead of NOWAIT, especially as we assume that there will be a lot of table locks.

lock table employee_no_delete in share mode wait 10;

Alternative Method

I've seen this implemented manually by creating another table that tracks table names--and then the API must lock the appropriate row.

CREATE table TABLE_WITH_VOID_COLUMN (
TABLE_NAME VARCHAR2(256),
constraint "TABLE_NAME_PK" primary key ("TABLE_NAME")
)
/

insert into table_with_void_column (table_name)
values ('EMPLOYEE_NO_DELETE');

commit;

We would never actually update this row, but would lock it in order to interact between sessions.

That method involves creating a new table and code to accomplish something Oracle already provides. Obviously, that is something we are already attempting by adding the VOID column and coding around it, so I'm not surprised to see custom table locking implemented by the same folks who implement custom delete handling.

Back to the API

In order to ensure that a row is never deleted and that we never have a duplicate SSN, we need an API such as the one below.

create or replace package employee_ND_api is

procedure ins ( p_emp_rec in employee_no_delete%rowtype) ;

end employee_ND_api;
/

create or replace package body employee_ND_api is

procedure ins ( p_emp_rec in employee_no_delete%rowtype) is

duplicate_ssn exception;
l_count number;

begin

lock table employee_no_delete in share mode wait 10;
begin

select 1 into l\_count  
  from dual  
  where exists (select 1 from employee\_no\_delete e  
    where e.ssn = p\_emp\_rec.ssn);  

-- oops we found a row already there  
raise duplicate\_ssn;  

exception when no_data_found then null; -- ok to continue
insert into employee_no_delete values p_emp_rec;
end;

-- note: We can not commit. There should be a full transaction,
-- actions before and after this action, that need to be
-- committed together. Hence, we lock the row in
-- table_with_void_column until the whole transaction
-- completes.
end ins;

end employee_ND_api;
/

The update routine would be slightly more complicated as we must also lock the row we intend to update, but the INS routine above points to some issues already. We have now locked the entire EMPLOYEE_NO_DELETE table. As noted, we can't commit the newly inserted record as there may be other DML that needs to occur--inserts or updates to other rows or data in other tables. Hence, the entire table remains locked until the final commit or rollback. No other session can insert or update any row of EMPLOYEE_NO_DELETE until we complete.

Moreover, there may be many tables--perhaps all tables--in our system with this same requirement. Hence, when we attempt to update data in another table in our unified transaction, we will need to take the same approach--lock the entire table. Unless every transaction in our system always follows the same order, we will certainly run in to deadlocks on a frequent basis: one session will lock EMPLOYEE_NO_DELETE, another will lock DEPT_NO_DELETE, the first will attempt to lock DEPT_NO_DELETE but be blocked. Then the second will attempt to lock EMPLOYEE_NO_DELETE and the database will detect a deadlock--forcing a rollback of one of the sessions. There is no way to avoid this.

The Incremental Update Requirement

Clearly EMPLOYEE_NO_DELETE, with its void column, has problems. The requirement to do incremental updates of another system, though, remains. If we return to the EMPLOYEE table and allow the row to be deleted using the EMPLOYEE table rather than setting VOID = 'V' in the EMPLOYEE_NO_DELETE table, how does the incremental update routine know to remove (or void) the row?

This is trivial with the use of a trigger on the EMPLOYEE table. Whenever a delete occurs, write a row to another table to indicate the delete. I'll mention Tom Kyte's dislike of triggers here again. I generally agree with Tom on this point. I don't even like the trigger I used above to populate the last_updated column. In the case of audit tables, though, I think a trigger is absolutely warranted. We are not changing any data in the base table and there are no foreign keys or even constraints on the audit table. Users would only ever be granted SELECT on that table. This is the place for a trigger. Depending upon our audit requirements, we might just indicated who took the action, when and whether it was an insert, update or delete. If we really need traceability, though, it's easy to capture the whole image of the row. I'll do that for this example:

-- create the audit table

CREATE table "EMPLOYEE_AUDIT" (
"ID" NUMBER,
"SSN" VARCHAR2(32) NOT NULL,
"LAST_NAME" VARCHAR2(128) NOT NULL,
"FIRST_NAME" VARCHAR2(128),
"SALARY" NUMBER,
"OTHER_INFO" VARCHAR2(256),
"LAST_UPDATED" DATE NOT NULL,
"ROW_ACTION" VARCHAR2(32)
)
/

-- create the trigger

create or replace trigger "EMPLOYEE_ARIUT"
AFTER
insert or update or delete on "EMPLOYEE"
for each row
declare
l_action varchar2(32);
begin

if inserting then l_action := 'INSERT';
elsif updating then l_action := 'UPDATE';
else l_action := delete;
insert into employee_audit (id, ssn, last_name, first_name
, salary, other_info, last_updated, row_action)
values (:new.id, :new.ssn, :new.last_name, :new.first_name
, :new.salary, :new.other_info, :new.last_updated, l_action);

end;
/

The incremental routine can simply query the deleted row to gather the data. With sufficient data in the audit table, we can create a view that looks exactly like EMPLOYEE_NO_DELETE, but without its inherent shortcomings (nay, fatal flaws).

--create a view that has the deleted row
create or replace view employee_with_void
select id, ssn, last_name, first_name, salary
, other_info, last_updated, null void
from employee
union
select id, ssn, last_name, first_name, salary
, other_info, last_updated, 'V' void
from employee_audit
where row_action = 'DELETE'
/

The audit table can provide much better information if we need it. With just a LAST_UPDATED column (and no audit table), the incremental routine would never know about multiple changes that occur between the incremental runs. It may not need to, but if it does, the audit table provides that ability.

In fact (not supposition, but absolute fact) the right method is to allow the row to be deleted. This provides true data protection, performs better and requires far less code which is also far less complicated.

Still Unconvinced?

OK, maybe you don't have the requirement for any unique constraints. First, I don't buy that argument for well over 90% of tables. If you are doing incremental updates in another system, you need some way to identify where to apply the incremental update--that would be the logical unique key. But, for the sake of argument, we will assume that you don't have any unique keys. That means that you don't need to be concerned about multi-user concurrency issues as much. What about that VOID column, though? Will you allow a row to be "unvoided?" Will you allow any updates to a voided record? If not, you would definitely need an API to keep that from happening. Given every possible argument and every leniency of requirement, it will still be less complicated for developers and users of the data to AVOID the VOID. For every query implemented by every user or developer save the incremental update routine the query must contain

where void is null

Indexes may need to include the VOID column--you'll need to do the analysis on a table by table basis.

You definitely need an API and likely slow by slow processing everywhere.

And here is the worst part, even the incremental load process is more complicated--if you want to get it right. If you just have two columns for audit/delete purposes, LAST_UPDATED and VOID, you lack the fidelity of knowing a row ever existed or what its incremental states were. Take this scenario, for example:

midnight: incremental process runs
9:00 am User adds employee Smith, SSN = 123456789
9:15 am User updates Smith to Jones, SSN = 123459999
10:00 am Payroll runs and pays Jones $500
noon User updates Jones to Smith, SSN = 123456789
1:00 pm User voids the record
midnight: incremental process runs

All that the last incremental process sees is that there was a row for Smith, SSN = 123456789 and that it was voided. What is the incremental process expected to do with that information?

Need I mention foreign keys? If you have either a parent or a child record, how do you handle the relationship? Obviously a parent record can't be deleted; it would have to be voided. All child records would have to be voided as well. The cascading all has to be coded for--not to mention the locking of all of the cascading. At this point I have to ask: why did your company spend so much money on an Oracle database? I doubt it was to hire developers to code the same features again--with less functionality.

Adding a table is easy. Creating a view is easy. All subsequent code benefits from these constructs. Why does the VOID persist? If anyone has a reason for a VOID column, let me know in the comments. Until then, please join with me in this movement--AVOID the VOID.