Migrating Sun Grid Engine qmaster from lx24-x86 to lx24-amd64
We shifted our Sun Grid Engine qmaster (queue master) from an old server to a new virtual machine. The old server was so old it had a different architecture from the VM and thus required a little bit of extra tweaking to get working.
Essentially when the SGE qmaster was started on the new host it would immediately quit.
Running strace -f before the call to qmaster found that there was a problem with reading the database. The root of the problem was traced to the Berkeley DB that the qmaster uses to store information. This was stored in a native format, but the architecture of the old server (x86) and the new VM (amd64) was different.
The first solution, to test that this was indeed the problem, was to hobble the architecture detection and see if the lx24-x86 qmaster would start on the new qmaster and read the database. It did.
From x86 to amd64
As the only x86 node on the network, there wasn't really much point in keeping the qmaster running on x86, so a fix for the database files was looked for. This was found in the db_dump and db_load utilities from db4-util.
On the old qmaster the database was dumped:
cd /opt/sge/uoa-dos/spool/spooldb db_dump sge > sge.dump db_dump sge_job > sge_job.dump
These files were then copied across to sge-qmaster.stat and then reinstated:
db_load -f sge.dump sge db_load -f sge_job.dump sge_job
Note that in the original directory there was a gigantic message of files taking roughly 16MB:
total 16M 12K __db.001 452K __db.002 264K __db.003 40K __db.004 236K __db.005 12K __db.006 11M log.0000000303 176K sge 4.7M sge_job
Only the sge and sge_job files were created by db_load. When the qmaster started for the first time, the missing files were initialised:
total 2.2M 12K __db.001 500K __db.002 228K __db.003 36K __db.004 380K __db.005 12K __db.006 852K log.0000000001 120K sge 76K sge_job
qstat shows that it is correct.
Changing qmaster on the exec hosts
A puppet file to change the act_qmaster was created. It was only a brief file. A more sophisticated one would softstop the daemon and then start the daemon with the changed settings.
file {'act_qmaster':
      path    => '/opt/sge/uoa-dos/common/act_qmaster',
      ensure  => present,
      owner   => "sgeadmin",
      mode    => 0644,
      content => "sge-qmaster.stat.auckland.ac.nz",
}Procedure for updating each exec host:
- sge_execd softstop (this leaves jobs running)
- puppet apply
- sge_exec start
As this was applied to each host, the host would be disappear from the list maintained by the old qmaster and appear on the list maintained by sge-qmaster.stat. Our submit host was changed first, so that users of that system would read from the new qmaster and submit jobs to the right place.
Once satisfied, qmaster was stopped on the old qmaster, and we could prepare that system for turning off.
That worked - or did it
Now SGE had been switched over to the new qmaster (although not using the recommended method) the jobs continued to run.
There were some small wrinkles: not everything was reflected accurately in the accounting that the qmaster was keeping. Jobs would complain, some wouldn't be culled off the list and remained as phantom "running" jobs for days afterwards as they expected to report to the old qmaster. That took a bit of cleaning up. PE had incorrect accounting, leaving SGE thinking more slots were occupied than there really were.
Restarting the execd on each host helped. Not long after this a grid-wide upgraded was pushed through, where each host was restarted. That fixed the PE accounting problem.
Summary
Urgh.