2008/03/07

Kerio 6.5.0 update/upgrade - WAIT

Wow; a 15-hour day yesterday, and I'm up at 4am still thinking about our failed attempt to upgrade from Kerio 6.4.2 to 6.5.0.

Though I woke up early, because I think I have a solution.

HOWEVER, this won't be an option for some shops, so if you're considering 6.5.0, please consider carefully, and have a rollback* plan.

It seems that there's a bug in the new 6.5.0 code (my own conjecture - as well as the two Kerio Support people I spoke with yesterday).

For us the manifestation was, as soon as everyone arrived and started logging in, the response for the users ranged from terribly slow to simply unable to log in. Users accessing via web were able to, after several minutes, log in and work almost normally. Users on Entourage (the vast majority) were frequently unable to log in at all, and when they were it took, I kid you not, hours, to sync - and the sync rarely completed before an error message. (The error messages were actually varied, with some that I haven't seen before, which were something to the effect that if you selected one of the choices, you'd be wiping mail out.)

A very painful day all around.

Back and forth with Kerio Support and nothing we tried was able to improve the situation to a noticeable degree.

However, after catching almost 5 whopping hours of sleep, I think I may have the answer.

First, a few words about our config, that are probably significant: Kerio is configured with accounts "imported" from an Open Directory server (LDAP on Mac OS X Server). Neither the OD server nor the Kerio server (also on Mac OS X Server) broke a sweat (CPU idle on both was in the 8X-9X% range, and there was plenty of free RAM; network bandwidth was fine too) during the festivities yesterday, but something was bogging down Kerio's queries of the OD server. Actually it appears that the queries and responses were fine and specifically, quick. However Kerio seemed to be getting bogged somewhere in its process around the queries. It also, for most of the day, reported almost every minute, that it had reached its limit of LDAP queries (32 simultaneous) - increasing that limit to 64 didn't help and 128 didn't either; something else was going on.

There does seem to be a correlation between the problem and lots of Entourage users, however I now think Entourage is a red herring; the issue that everyone missed so far is spam - specifically spam addressed to non-existent addresses. In which case, the message goes through to Kerio, Kerio hammers OD with (I believe) buggy code and gets bogged down for any operation, including Entourage which ends up collateral damage. (Not that Entourage is totally innocent. :)

Here's my theory for a fix: We have a Barracuda and can configure it to talk to OD directly. It's a standard anti-spam feature and it will hammer OD for exactly the same queries, only directly. Kerio will not only not bog, it'll be *faster* than it was before, since it'll never see any of that spam.

So far just a theory...

(*Speaking of rollback: Kerio support would only give me "no guarantees" and "your plan sounds good" (grumble) however it appears to be as simple as this: Shut the server down; run the upgrade installer to UNinstall; restore your mailserver.cfg file from before the upgrade; replace the prior mailserver.cfg file (BEFORE re-running the upgrade); re-run the upgrade - which auto-starts the server; check, test, confirm...)

No comments: