I always thought I had good Microsoft Exchange skills, it’s bread and butter really – installations, migrations and upgrades, with a bit of support here and there. I know what I need to know and know it pretty well. I’ve always assumed that there was a safety net for any IT guy, I really thought that would be manufacturer support until I had cause to cash in a Microsoft support credit this week. Not only did I find that safety net didn’t exist, I also found my skills were beyond that of the people Microsoft use and the Microsoft people don’t even work for Microsoft!…
I’ll warn you now, the feral IT Manager in me is well and truly out in this article and I’m going to talk tech.
This is my Exchange-mare before Christmas…
There’s a few of things that scare me in this world of IT – and SQL and Exchange database corruption are up there. Now I can probably feel a few not-so-techy IT guys saying, “backups” out there. I’d firstly like to say, you’re idiots. Then I should probably also say, yeah it would have been good.
However, backups are great but they’re a snapshot in time maybe 9pm at night. Unless you’re using a usually very intensive, resource hungry, continual protection backup tool and few of them work well. Indeed there are ways of developing resilient Exchange architecture also… but expensive.
The crash that ruined my week happened at 2pm on Monday.
Monday morning about 8.30am
One user called to say he’d had an issue connecting to the mail server. I quickly rebooted it and the problem went away. Don’t judge me IT guys, we all do it.
Monday afternoon about 3pm
Calls were in from a few users about a mail issue. I thought it was just a repeat of the morning issue, issued a reboot and it didn’t fix.
Drawing on my experience I’ve had issues before. So I fired up the Exchange PowerShell and got cracking. I knew the database was in dirty shutdown mode (fnar) and dismounted the database (fnar again) and used ESEUTIL to repair and defragment the disk, took hours. Afterwards I was left reading a guide for Exchange 2007 on Spiceworks which talked about ISInteg to check the integrity of mailboxes, which did not exist in the new version. Even without this, I have successfully fixed a database before by this method, and remounted it fine.
So my database had remounted fine, my own mailbox had connected and mails were firing through. Job done.
Beep beep, 7am text messages in.
I still had reports of users not able to get mail, I could see that the server connection was going on and off-line. I had to start investigating again.
There was no way I was going to roll-back to the backup as a) it was now too old b) I had no idea if it would restore and reintegrate back into the environment correctly without testing. I’d also assumed that the system was largely fixed and allowed half a day of emails into the system. I really wasn’t sure how the up-to-date PST’s would like a mailbox database that was a couple of days old, not to mention the bother I’d be in with the users.
Exchange 2013 has an online integrity checker that you can start by running New-MailboxRepairRequest -Database <DatabaseID> –CorruptionType SearchFolder,AggregateCounts,ProvisionedFolder,FolderView this bad boy should scan through your database like ISInteg used to do which would have required your mailboxes to be offline.
My problem is it was failing for two corruption types, always the same FolderView and ProvisionedFolder, and kicked an error back into the event log which led me to an article here. Which at the time had no answer.
I also had some issues kicking about in the Exchange Server that turned out nothing to do with the crash and the problems I was having – they were creating log entries and very possibly the reason the repair request wasn’t running, so I worked to clear those. They were to do with Arbitration accounts that exchange uses in the background for different jobs, including mailbox moves (that I had a feeling was coming). One of my accounts was having problems due to it being installed on a default mailbox database that I later disabled, but thought I’d successfully moved the account.
Although I had the DB up and running I had a degree of ‘fluttering’ going on with the mailbox database dismounting unexpectedly and remounting. Disrupting mail to all users. I also couldn’t get mail through to about 10 users, the MD being one of them (typically) as well as some of the busy senior managers. Many with large mailboxes.
I had a very good report from one user who told me that she was trying to get mail through to a couple of users and getting a bounce saying Quarantine. Finally I had a clue what was going on.
Things had settled down, the fluttering seemed to have stopped but I couldn’t get mail through to a number of users and was getting various bounces, quarantine and unexpected error.
I cleared the quarantine, re-ran ESEUTIL as I thought it hadn’t been successful and after some time brought everything back online. The fluttering started and boxes returned to quarantine. I still couldn’t run New-MailboxRepairRequest successfully.
It was 10pm. An all-nighter was coming.
First I picked up the phone to Dell, it’s running on Dell hardware and I vaguely remember a conversation where I got some help with an application, I thought it had to be OEM but I half remembered the conversation it could be anything with Pro Support for IT. Alas I was wrong, but the Dell guys to be fair, had a go, listened to my problem and said they were out – was beyond their skills without having to pay.
I had left it too late to call my IT guy friends who would have had a few words of wisdom.
Anyway, I remembered I had a MS 24/7 incident I could use with Software Assurance. So luckily, I figured out how to raise it all online and someone did call back relatively quickly. I raised the request with the same question about the New-MailboxRepairRequest as in this article but also complaining I was having the symptoms with the mailboxes.
Was an Indian call centre that called… okay everyone knows my thoughts on those, but this is Microsoft, I’m expecting great things. These guys should be the best!
The guy was good to be fair – however in my heart of hearts I knew he was only about as good as me – and I needed better. I probably could have done the same work migrating faulty mailboxes quicker without waiting for someone to take his time diagnosing things. It’s always disheartening when the Microsoft engineer is clearly pasting commands from the same websites I’d just read too!
Ultimately, I knew migrating faulty mailboxes was my next step as I’d already created a new database ready on a new vmware disk to do just that before going on the call. A lot of people in the forums are saying that in the new version of Exchange without ISinteg then migration is the most solid way of having the mailbox checked out.
However my question remained – why was the FolderView and ProvisionedFolder checks always failing. The answer I got was ‘corruption’. I thought that was odd and I pointed out it was happening to mailboxes that were working fine, alas I got that pregnant silence you get with Indian call centres when you’re not sure whether you’re not understood or they just clearly don’t want to answer and don’t want to tell you that they just don’t know.
Anyway, I left it with the guy it was getting very late – I wasn’t convinced the mailboxes we’d took out of quarantine had come back up properly because the New-MailboxRepairRequest hadn’t worked correctly on them. I said to him that the test would be in the morning when the users came in and I’d need help in the morning. Bedtime was about 2am, I didn’t sleep well.