This is a page to document common problems with OpenAFS and how to fix them. (This does not mean something like bugs in OpenAFS, but things like misconfiguration problems and things like that.)
First, a few things that you should do before trying to debug virtually any error:
- Make sure the clocks are synced with NTP.
- Make sure the KeyFile is the same on each database and file server.
- Make sure that there are no lines in /etc/hosts that set the machine's real IP address to localhost, or the non-FQDN name.
"No quorum elected"
This happens when you try to create a new user in the Protection Database or some other such change in a database that uses the Ubik protocol. The actual problem is that the database servers have not decided which server is the "sync site". A couple of things can cause this.
First and foremost, make sure that the clocks between all of the database servers are synchronized. If they are so much as a few minutes out of sync, it will throw off the entire ubik protocol (and other things such as Kerberos). Use NTP to keep the clock synchronized to ntp1.tjhsst.edu and ntp2.tjhsst.edu.
Network disconnectivity between the database servers can also theoretically cause this. Make sure that the servers are communicating with each other. (They shouldn't be connected to each other through a NAT or something.)
If none of the above works, you can examine the situation by running udebug on the database server with the lowest numbered IP address. Run
udebug -server lowestserver -port 7002
It should say something like
I am sync site until X secs from now
in there, indicating that that server is the sync site. It also displays some other information about the other database servers, and when they last casted a vote.
Aklog takes a long time
If the "aklog" command is taking a long time sometimes, try running it with the "-d" switch to see where it is hanging. If it is hanging on the message
About to resolve name adeason to id in cell csl.tjhsst.edu.
Then for some reason one of the database servers is not responding immediately. In the meantime, you can run aklog with the "-noprdb" switch to bypass this if it cannot be fixed.
Examine all of the database servers and make sure their protection servers and vlservers are running properly. Just run "bos status" for each server.