First, I can only fix problems which are reproducible. Can we do another power outage at PSI again?
But seriously, I guess what happened is that elog sees an empty directory when the AFS server
goes down. If this happens, it rebuilds its internal (RAM based) indes, and sees no entries there.
So the next entry will be ID 1. That should be independent of the ELOG version. I guess if PSI
would have a power outage a year ago, you could have had the same problem.
I had problem some long time a go with AFS, where the network access blocked the program
for several minutes. I decided then to ONLY use local filesystems for elog servers, and do the
backup via rsync to an AFS account. Since then I never had problems.
Now it is hard for me to develop code which avoids the mentioned problem. I could maybe
check if there are many entries, and all over sudden there are no entries any more, the server
just stops with some detailed error message. But it is hard for me to mimic the AFS server
outage. I can try to manually delete elog files and see what happens, but this only partially
mimics network problems.
/Stefan
> I'm running ELOG since several years with rather heavy usage.
> Last week I've upgrades from 3.0.0 to 3.1.0 and this week I had twice the same problem:
> elogd lost the index for old entries and showed empty logbooks, without having restarted.
> The logbooks appeared to be empty; new entries started with index "1".
> The first time the origin of the problem were network troubles;
> the second time it had been caused by a severe problem of our AFS file system service.
> I never experienced this consequence for ELOG in the past when we had AFS problems.
>
> Since the logbooks are used for the operation log of a user facility they continued to do new entries.
> The next day I had to re-number the new entries and restart elogd and everything was fine.
>
> I could understand if elogd crashes when the filesystem of the logbook goes away.
> And when it restarts with an (temporarily) empty filesystem, that would explain what happened.
> But it did not restart and the log file does not contain any information about any problem,
> just that suddenly all new entries in each logbook started with ID "1" again.
>
> Stefan, any idea?
> Anyone else ever experienced that with the new ELOG version (or older ones)? |