Thursday, June 28, 2012

Snapshot Too Old after SAN outrage

Recently we experienced an SAN outrage. The databases was brought up after SAN was fixed. However we received an ORA-01555 error “Snapshot too old” when we run a SQL statement, e.g.
select * 
  from scott.orders 
 where order_date between trunc(sysdate - 1) and trunc(sysdate);
ORA-01555 is caused by Oracle read consistency mechanism. When the long run query is not able to find the before-image from undo segment, ORA-01555 occurs. But I was very sure it is not our case, since the application server was still down and there were sessions only from dba. And I also noticed the SQL statement was successful without ORA-01555 if
select * 
  from scott.orders 
 where order_date between trunc(sysdate - 2) and trunc(sysdate - 1); 
It makes me think there is data block corruption for the data in the time range of trunc(sysdate - 1) and trunc(sysdate).


Oracle Metalink document 787004.1 describes a very similar situation, except that we have xml data type. So I continued the investigation.

select order_id 
  from scott.orders 
 where order_date between trunc(sysdate - 1) and trunc(sysdate);
select order_xml 
  from scott.orders 
 where order_date between trunc(sysdate - 1) and trunc(sysdate);

The first SQL was successful. The second got the same ORA-01555 error. Now it is confirmed order_xml column was corrupted.


I wrote a script to find out the corrupt data block,
DECLARE
   n   NUMBER;
   s   VARCHAR2 (4000);
BEGIN
   FOR c1 IN (SELECT *
                FROM scott.orders
               WHERE empno)
   LOOP
      BEGIN
         s := XMLType.getClobVal (c1.order_xml);
      EXCEPTION
         WHEN OTHERS
         THEN
            DBMS_OUTPUT.put_line ('ORA' || SQLCODE || ': order_id = ' || c1.order_id);
      END;
   END LOOP;
END;