THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Michael Zilberstein

Corruption case

Recently I had to take care of the most interesting corruption case I’ve even seen, so decided to share this experience with you. We’re talking about small accounting program which keeps its data in SQL Server Express – in this particular case in SQL Server 2005. The customer called today and sent me following error screen (nice screenshot – taken with cellular phone camera Smile):

image

Upon connecting to the server I’ve immediately noticed dumps with the same error. Here is entire error message:

A time-out occurred while waiting for buffer latch -- type 2, bp 04268450, page 1:804, stat 0xc00009, database id: 5, allocation unit Id: 72057594108248064, task 0x00A186B8 : 0, waittime 300, flags 0x1a, owning task 0x00A0A4D8. Not continuing to wait.

 This article was extremely helpful in analyzing this dump with WinDbg tool – flow and somewhat cryptic commands described there easily pinpointed the guilty thread and its call stack:

sqlservr!LatchBase::AcquireInternal
sqlservr!BUF::AcquireLatch
sqlservr!BPool::Get
sqlservr!PageRef::Fix
sqlservr!IndexPageManager::GetPageForLinkModification
sqlservr!RemoveBTreePageIfUnchangedInternal
sqlservr!RemoveBTreePageIfUnchanged
sqlservr!CleanVersionsOnBTreePage
sqlservr!IndexDataSetSession::CleanupVersionsOnPage
sqlservr!GhostExorciser::CleanupPage
sqlservr!TaskGhostCleanup::ProcessTskPkt
sqlservr!GhostRecordCleanupTask
sqlservr!CGhostCleanupTask::ProcessTskPkt
sqlservr!TaskReqPktTimer::ExecuteTask
sqlservr!OnDemandTaskContext::ProcessTskPkt
sqlservr!SystemTaskContext::ExecuteFunc
sqlservr!SystemTaskEntryPoint
sqlservr!OnDemandTaskContext::FuncEntryPoint
sqlservr!SOS_Task::Param::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!_callthreadstartex
msvcr80!_threadstartex

Highlighted part of the call stack indicates that Ghost Cleanup process caused this failure. Actually, I had a similar encounter with stuck ghost cleanup several years ago. The easiest way to verify that it is still stuck is to execute DBCC CHECKDB command. Indeed CHECKDB appeared to be blocked by ghost cleanup session.

What’s next? You can’t kill system session. Indeed you can’t by you can start SQL Server without it – using trace flag 661 as startup parameter (don’t forget to remove flag and restart service again if you use this flag). So, after restarting service, rebuilding index in question, removing trace flag and restarting service again, I’ve already thought that I’ve earned my beer database is fixed.

Not so fast. Now DBCC CHECKDB succeeds to complete but results are very very red. Including interesting messages like:

Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,column_id=1) in sys.columns does not have a matching row (object_id=1575689407) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,column_id=2) in sys.columns does not have a matching row (object_id=1575689407) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,column_id=3) in sys.columns does not have a matching row (object_id=1575689407) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,index_id=0) in sys.indexes does not have a matching row (object_id=1575689407) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3855, State 1: Attribute (data_space_id=1) exists without a row (object_id=1575689407,index_id=0) in sys.indexes.

Nice, uh? After first wave of shock has passed, I checked and found out that indeed object with that id doesn’t exist. So it seems that all we need to do is to delete 3 rows from sys.columns and 1 row from sys.indexes. Ah, but those sys.something objects are views, aren’t they? And what are the real objects? The way to find real – internal – tables and columns is via execution plans:

image

So actually we need to delete rows from sys.syscolpars and from sys.sysidxstats (notice that column names are also different). How do we do it? Let’s try DAC (Dedicated Admin Connection)? No way – Express Edition doesn’t support DAC. Unless… unless we use trace flag 7806 as startup parameter.

Restart server again, connect using DAC, try to delete rows… Oops,

Msg 259, Level 16, State 1, Line 1
Ad hoc updates to system catalogs are not allowed.

But for this we have Paul Randal’s instructions. So: “ sqlservr.exe -sInstanceName -m -T661 –T7806 “, then “ sqlcmd -S.\InstanceName /A ” and finally…

   1: DELETE FROM sys.syscolpars WHERE id = 1575689407 AND number = 0
   2: GO
   3: DELETE FROM sys.sysidxstats WHERE id = 1575689407 AND indid = 0
   4: GO

Now stop the server, remove all trace flags, start server in a normal way and verify that DBCC CHECKDB returns nothing. Bingo! And… well deserved my own home-brewed Scottish Ale!

image

Published Friday, March 22, 2013 3:49 AM by Michael Zilberstein

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Paul S. Randal said:

Nicely done!

March 21, 2013 8:27 PM
 

Pablo Montilla said:

Mind boggling! Excellent work.

March 21, 2013 8:42 PM
 

Michael Zilberstein said:

Thanks, Paul and Pablo!

March 24, 2013 8:29 AM
 

Yaniv said:

cool, nice post and excellent job done.

March 24, 2013 9:00 AM
 

JRStern said:

Very nice, and if it happens to me super-nice that I can refer to it, but isn't there still a question of what happened to cause this?

March 26, 2013 10:52 PM
 

Michael Zilberstein said:

JRStern,

I've seen this sort of corruption for a second time in almost 13 years. Take into account that this small bookkeeping program usually runs on desktops and not in high-tech companies, not all of them know what UPS is. And SQL Server is extremely vulnerable to power outages. You never know what corruption will happen - will it be master database, or some index in user database, or metadata is user database, or inconsistent lsn between data file and log - I've seen all of the above and much more in 4 or 5 years I'm working with the company that makes that bookkeeping program. In my case it was Ghost Cleanup that was stuck on particular page - couldn't proceed and didn't let anyone else to touch this page or entire database (for example, DBCC CHECKDB fails instantly with error message that it can't create database snapshot). The problem was that all available backups contained corrupted page, so instead of restoring I had to find a way to fix. But otherwise it would have been much more boring :-).

March 27, 2013 10:02 AM
 

Karthick P.K said:

Very nice article Michael. Just check if dbcc checkdb reports "The system catalog was updated directly in database" If so move the data to new database to stay in supported environment.

March 31, 2013 6:46 PM
 

Michael Zilberstein said:

Hi Karthick,

first, thanks for your blog - found it while working on this case and since then started reading it all bottom up, great stuff nobody else reveals.

Regarding CHECKDB - when I executed it with ALL_ERRORMSGS, NO_INFOMSGS, it returned clear. I don't have access to the environment now but if it is informational message then I guess, I've missed it because of NO_INFOMSGS and will have to execute again without it.

April 3, 2013 4:15 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement