You are here: Foswiki>Main Web>EinsteinAtHomeDBSwitch (13 Jul 2018, OliverBock)Edit Attach

Current servers

  • UWM
    • einstein-db2: existing production database server
    • einstein-db3: replica used to create mysql dumps
    • einstein-db4: replica used for webpages
  • AEI
    • einstein-db: replica

After switch servers

  • UWM
    • einstein-db2: will not be running a database unless something goes wrong.
    • einstein-db3: replica used to drive webpages
    • einstein-db4: new production database server
  • AEI
    • einstein-db: replica

Verify

  • openvpn
  • iptables
  • nagios-nrpe
  • other configs
  • packages
  • mysql permissions

Files Needed

  • einstein.uwm
    • Donemy.cnf.db2
    • Donemy.cnf.db4
    • Doneconfig.xml.db2
    • Doneconfig.xml.db4 (change replica and main)
    • Doneconfig_readonly.db4
    • Doneconfig_readonly.db3
    • Doneconfig_replica.db4
    • Doneconfig_replica.db3

  • einstein-db4.uwm
    • Done/root/add_eah_user.sh
    • Done /etc/my.cnf.production

Switch Process

  • Done Thursday, 20th
    • Work on detailed plan
    • Agree on timeline and plan details
  • Done Friday, 21st
    • Assign tasks/responsibilities
    • Finalize plan
    • Sign-off final plan
    • Post news item about planned downtime
  • Done Monday, 24th
    • Final preparations
    • Double-check plan! Any open questions? Is everything in place?
    • Start working on infrastructure
      • Done make sure everyone involved can access all servers (in particular: einstein-db3/4/5)
      • einstein-db4
        • Done Setup openvpn (see einstein-db2: /etc/openvpn)
        • Done Setup iptables (see einstein-db2)
        • Done Setup MySQL config (see einstein-db2: my.cnf)
  • Tuesday, 25th
    • Done Shutdown project (14:00 UTC)
      • Done stop (bin/stop) project Hannover (wait until stoppped!)
      • Done stop (bin/stop) project UWM (einstein.uwm), touch stop_web
      • Done write a note into ~/EinsteinAtHome/html/user/motd.php since news forum won't be available
    • Done Prepare migration
      1. einstein-db2
        1. Done make sure all clients/sessions have stopped (other than slaves)
      2. einstein-db3
        1. Done stop slave
      3. einstein-db4
        1. Done stop slave
        2. Done move master.info and relay-log.info to /raidc/mysql.savedfiles (we do not want it replicate anymore)
      4. einstein-db
        1. Done stop slave
        2. Done change VPN config (db2->db4) and test it (at least ping)
      5. einstein-db2
        1. Done make sure all clients/slaves/sessions have stopped
      6. einstein-wug
        1. Done change VPN config (db2->db4) and test it (at least ping)
    • Done Do migration
      1. einstein-db2
        1. create dump of old master/production DB
          1. Done Flush and lock all tables:
            FLUSH TABLES WITH READ LOCK;
          2. Done Quit the MySQL shell
          3. DoneDump the database:
            time mysqldump --opt --verbose -p EinsteinAtHome --master-data=2 > /raidc/EinsteinAtHome_db4_20110125.dump
          4. DoneTransfer the database to Hannover:
            globus-url-copy -tcp-bs 1048576 -vb -p 512 file:/raidc/EinsteinAtHome_db2_20110125.dump gsiftp://einsteinhsm@atlas1.atlas.aei.uni-hannover.de/home/einsteinhsm/
          5. Donerun sha1sum on dump and note down hash value (einstein-db.aei, db3,db4,db2 all verified .sha1 file
            9cad03d1661a0beb24915da1a6706efde3dede55
        2. Closed note privileges of user eah
          SHOW GRANTS
        3. Closed lock out eah user (don't delete it!)
          REVOKE ALL PRIVILEGES, GRANT OPTION FROM eah
        4. Done unplug network cable
      2. einstein-db4
        1. Done Drop database EinsteinAtHome and recreate it
        2. Done Create e dedicated user account for replication
          1. Log in to MySQL as root
          2. Repeat for all slaves: using a fixed IP (slave's VPN endpoint!):
            GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'<SLAVE-IP>' IDENTIFIED BY '<REPLICATOR-PASSWORD>';
        3. Done restore EinsteinAtHome schema from dump
          mysql -p EinsteinAtHome < /path/to/EinsteinAtHome.dump
        4. Done add eah user
        5. Done grant privileges according to what was set up on einstein-db2 (noted down earlier) -> Example:
          GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, SHUTDOWN, PROCESS, FILE, REFERENCES, INDEX, ALTER, SHOW DATABASES, SUPER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE ON *.* TO 'eah'@'%'
          /root/eah_user_add.sh
        6. Done Update settings:
          FLUSH PRIVILEGES;
        7. Done Save "show master status"
          mysql -e "show master status;" > /raidc/EinsteinAtHome_db2_20110125.dump.db4_master_status
          [root@einstein-db4 ~]# mysql -e "show master status;"
          +-------------------+-----------+----------------+------------------+
          | File              | Position  | Binlog_Do_DB   | Binlog_Ignore_DB |
          +-------------------+-----------+----------------+------------------+
          | binary_log.000143 | 357650341 | EinsteinAtHome |                  | 
          +-------------------+-----------+----------------+------------------+
          
        8. Done Determine table checksums of user, host, workunit, result, file, fileset, fileset_file and sched_trigger
          CHECKSUM TABLE user EXTENDED;
          • Table user: 3784248569
          • Table host: 2593060393
          • Table workunit: 351601422
          • Table result: 1212766670
          • Table file: 2008314790
          • Table fileset: 465860950
          • Table fileset_file: 4097335178
          • Table sched_trigger: 1852664575
        9. Done Check / restore / set up indexes
        10. Done Analyze tables user, host, workunit, result, file, fileset, fileset_file and sched_trigger
          ANALYZE TABLE user;
      3. einstein-db3 (in parallel to einstein-db4!)
        1. Done Drop database EinsteinAtHome and recreate it
        2. Done Restore EinsteinAtHome schema from dump
          mysql -p EinsteinAtHome < /path/to/EinsteinAtHome.dump
        3. Done Verify table checksums of user, host, workunit, result, file, fileset, fileset_file and sched_trigger (see above)
      4. einstein-db
        1. Done Create dump (in parallel to einstein-db2!)
    • Done Post-processing
      • einstein-db3 (set up replica)
        1. Done Check my.cnf and make sure new master is set
        2. Done Change master log file and position (based on show master status above)
          CHANGE MASTER TO MASTER_HOST='192.168.10.14', MASTER_USER='replicator', MASTER_PASSWORD='<REPLICATOR-PASSWORD>', MASTER_PORT=3306, MASTER_LOG_FILE='binary_log.000143', MASTER_LOG_POS=357650341;
        3. Done Start slave thread
      • einstein-db4
        1. Done configure/check nagios nrpe
  • Wednesday, 26th
    • Post-processing
      • einstein-db (set up replica)
        1. Done Verify sha1 hash of dump available in Hannover (see original hash value above)
        2. Done Drop database EinsteinAtHome and recreate it
        3. Done Restore EinsteinAtHome schema from dump
          mysql -p EinsteinAtHome < /path/to/EinsteinAtHome.dump
        4. Done Verify table checksums of user, host, workunit, result, file, fileset, fileset_file and sched_trigger (see above)
          • DONE user, host, file, fileset, fileset_file and sched_trigger
          • choice-no workunit, result differ!!! (note: these are the only one that contain TIMESTAMP columns)
          • ALERT! Tried a second restore, same result/checksums (workunit 988066414, result 2946951805) as before!
            • This means that there was no import issue
            • This means that this was not caused by a running slave thread (this was the case during/after the first restore)
            • Note: there was no 5.0 release mentioning an "Incompatible Change" since 5.0.51, meaning the TIMESTAMP columns shouldn't be responsible for this
            • Both affected tables contain the same number of records as the ones on db4 and db3
          • TIP Since db3 isn't really used/required right now:
            • Drop the current DB and restore it from the original dump to verify the original checksums again
            • It could be that the original checksums don't reflect the dump as they were taken after db3 was attached to db4. However, this should not change db4's tables in any way, and both server's checksums agreed!
            • In the meantime, let's start the slave thread as nothing happened...
            • We should also verify that the different versions (5.0.51a@AEI vs 5.0.83@UWM) aren't the root cause of this problem
        5. Done Change my.cnf and remove all master* entries that will be set by CHANGE MASTER TO below
        6. Done Change master log file and position (based on show master status above)
          CHANGE MASTER TO MASTER_HOST='192.168.10.14', MASTER_USER='replicator', MASTER_PASSWORD='<REPLICATOR-PASSWORD>', MASTER_PORT=3306, MASTER_LOG_FILE='binary_log.000143', MASTER_LOG_POS=357650341;
        7. Done Start slave thread
    • start up project (10:00 UTC)
      1. verify that ALL clients (e.g. config.xml) use the new password for user eah
      2. einstein.uwm.edu
        1. Done Change BOINC configs (xml)
        2. Done Change /etc/my.cnf
        3. Done Start project
        4. Done immediately disable scheduler
      3. einstein-wug.aei.uni-hannover.de
        1. Done Change BOINC configs (xml)
        2. Done Start project
        3. Done reconfigure nagios-nrpe (db2 -> db4)
      4. einstein-db
        1. Done reconfigure nagios-nrpe (db2 -> db4)
      5. einstein.uwm.edu
        1. Done edit config.xml for locality work <= 20%
        2. Done enable scheduler when enough unsent BRP3 tasks
        3. Done remove note from ~/EinsteinAtHome/html/user/motd.php

  • Thursday, 27th
    • Monitor project (I/O and CPU load, log files, query execution times, locks)
    • Fix issues
      • Replica table checksum issue:
        • einstein-db4
          +------------------------------+------------+
          | Table                        | Checksum   |
          +------------------------------+------------+
          | EinsteinAtHome.user          |  512451807 | 
          | EinsteinAtHome.host          | 4278721571 | 
          | EinsteinAtHome.workunit      | 3297539744 | 
          | EinsteinAtHome.result        | 1511958947 | 
          | EinsteinAtHome.file          | 2008314790 | 
          | EinsteinAtHome.fileset       |  465860950 | 
          | EinsteinAtHome.fileset_file  | 4097335178 | 
          | EinsteinAtHome.sched_trigger | 2174727538 | 
          +------------------------------+------------+
          8 rows in set (57 min 46.51 sec)
          
        • einstein-db3
          +------------------------------+------------+
          | Table                        | Checksum   |
          +------------------------------+------------+
          | EinsteinAtHome.user          |  512451807 | 
          | EinsteinAtHome.host          | 4278721571 | 
          | EinsteinAtHome.workunit      | 3297539744 | 
          | EinsteinAtHome.result        | 1511958947 | 
          | EinsteinAtHome.file          | 2008314790 | 
          | EinsteinAtHome.fileset       |  465860950 | 
          | EinsteinAtHome.fileset_file  | 4097335178 | 
          | EinsteinAtHome.sched_trigger | 2174727538 | 
          +------------------------------+------------+
          8 rows in set (57 min 1.42 sec)
          
        • einstein-db
          +------------------------------+------------+
          | Table                        | Checksum   |
          +------------------------------+------------+
          | EinsteinAtHome.user          |  512451807 | 
          | EinsteinAtHome.host          | 4278721571 | 
          | EinsteinAtHome.workunit      | 2402116790 | 
          | EinsteinAtHome.result        | 2284962296 | 
          | EinsteinAtHome.file          | 2008314790 | 
          | EinsteinAtHome.fileset       |  465860950 | 
          | EinsteinAtHome.fileset_file  | 4097335178 | 
          | EinsteinAtHome.sched_trigger | 2174727538 | 
          +------------------------------+------------+
          8 rows in set (10 min 20.81 sec)
          
          
          MySQL 5.0.51a (different database for restore)
          +----------------------------+------------+|
           Table                      | Checksum   |
          +----------------------------+------------+
          | restore_test.user          | 3784248569 |
          | restore_test.host          | 2593060393 |
          | restore_test.workunit      |  988066414 |
          | restore_test.result        | 2946951805 |
          | restore_test.file          | 2008314790 | 
          | restore_test.fileset       |  465860950 | 
          | restore_test.fileset_file  | 4097335178 | 
          | restore_test.sched_trigger | 1852664575 | 
          +----------------------------+------------+
          
          
          MySQL 5.0.83
          +----------------------------+------------+
          | Table                      | Checksum   |
          +----------------------------+------------+
          | restore_5083.user          | 3784248569 | 
          | restore_5083.host          | 2593060393 | 
          | restore_5083.workunit      |  351601422 | 
          | restore_5083.result        | 1212766670 | 
          | restore_5083.file          | 2008314790 | 
          | restore_5083.fileset       |  465860950 | 
          | restore_5083.fileset_file  | 4097335178 | 
          | restore_5083.sched_trigger | 1852664575 | 
          +----------------------------+------------+
          
        • Conclusion: again, the AEI replica tables workunit and result differ!
        • Tried to verify that the TIMESTAMP columns in are not responsible for this by using the following test table and running an INSERT followed by an UPDATE, generating the table checksum each time. There were no differences across all DB instances! This doesn't, however, exclude that during heavy load the automatic timestamps might produce different values (TIMESTAMP resolution: 1 second), say when replication is delayed by more than a second!
          CREATE TABLE `tstest` (
            `id` int(11) NOT NULL auto_increment,
            `name` varchar(254) NOT NULL default '',
            `mod_time` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
            PRIMARY KEY  (`id`)
          ) ENGINE=InnoDB AUTO_INCREMENT=92022422 DEFAULT CHARSET=latin1;
          
  • Friday, 28th
    • Monitor project (I/O and CPU load, log files, query execution times, locks)
    • Make sure the project survives the weekend

Failure Plan

What do we do if einstein-db4 crashes? Right now I think we should switch to einstein-db3 unless we know it will have the same problem. A detailed plan should be added here.

-- DavidHammer - 21 Sep 2010

I don't think that we should switch to yet another server right away. This decision should depend on what happened and when:
  • If we can't get db4 (the new master) up and running (project still offline) we should back out by putting db2 (the old master) into place again
  • If db4 crashes (what does this mean?) after the project was started up we should
    1. stop the project if possible
    2. analyse what's going on and then
      • fix the problem if possible
      • or use the replica on db3 if possible (this is only the case if the master DB itself is consistent (hence the failure is hardware-related) and the replica on db3 is up-to-date!)

-- OliverBehnke - 20 Jan 2011

Topic revision: r113 - 13 Jul 2018, OliverBock
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback