Something ugly happened to my LDAP database a while back, and I never noticed. I saw it had lost a bunch of records, but I’d put it down to some replication problem and never investigated. It wasn’t until I tried to replace one of the lost records, and got an error from LDAP telling me the non-existent record already existed, that I figured something was really wrong.
Multiple iterations of db_recover, attempts to re-index, dump-and-restores of the raw Berkely DB files… Nothing helped. In the end, all that was left was the slapcat-delete-slapadd dance.
(You know that your OpenLDAP is especially sick when commands like slapcat generate glibc backtraces. 😦 )
So with what was left of my LDAP data, I started to compare against my replicated LDAP server. The first thing I noticed was that a number of records that I expected to have been replicated were not. I figured that records in the master directory that were lost to database corruption and not to an LDAP operation (a modify or delete) should have been present on the replicated copy. This was not the case, which makes me think that replication only takes effect after the master directory’s backend is updated, and if something like a corrupted database prevents the master from being updated then the replication doesn’t take place. As Zaphod might say, ten points for directory consistency but minus several million for data preservation… 🙂
(As I think about this though, the more it doesn’t make sense. If slapd had been unable to update the backend, and hence the replication didn’t take place, surely that would have been returned to me as an update error? I know for a fact that the data I lost made it to the database because I tested an app using the data. It’s unreasonable to me to think that BDB would have returned success on a write operation unless it had actually done so, but I suppose write-caching might create an opportunity for that to occur… No, I suspect a different problem, maybe just replication being suspended at the time, as the real reason that some data was missing from the replica.)
Next I found, despite what I thought was happening based on the lost records, there were quite a few records that were on the replica. This makes me think I’ve had multiple failures, apparently at different times, that have impaired my master directory — one that caused new updates to be lost, the other resulting in loss of existing data.
I’ve added a step to my Bacula processing that performs a slapcat and backs up the resulting LDIF, so if anything happens in the future I have a bit of a chance of running through old files and restoring. The other thing that I’ll kick off is a process to verify the accuracy or integrity of the replica — this might tip me off to a problem sooner rather than later.
My theory on what the cause of this hassle was? Well a while ago I was having a bit of trouble with partitions filling. At a guess I’d say that OpenLDAP was trying to do something (update a transaction log maybe) at a time when the partition its data lives on was full, and got twisted. Soon I’m going to write a separate post with my (updated) thoughts about isolation of failure domains…
For those that haven’t seen it, here’s the process I used to get things back:
# cd /var/lib # slapcat > whatsleft.ldif # /etc/init.d/slapd stop # mv openldap-data openldap-data-old # mkdir openldap-data # chown ldap:ldap openldap-data # cp -a openldap-data-old/DB_CONFIG openldap-data/ # cd openldap-data # slapadd < ../whatsleft.ldif # chown ldap:ldap * # /etc/init.d/slapd start
Obviously if you find yourself in the unfortunate position of having to use this process, substitute your distribution’s values for the path to the OpenLDAP data directory and the user/group that LDAP runs under.