Current servers
- UWM
- einstein-db2: existing production database server
- einstein-db3: replica used to create mysql dumps
- einstein-db4: replica used for webpages
- AEI
After switch servers
- UWM
- einstein-db2: will not be running a database unless something goes wrong.
- einstein-db3: replica used to drive webpages
- einstein-db4: new production database server
- AEI
Verify
- openvpn
- iptables
- nagios-nrpe
- other configs
- packages
- mysql permissions
Files Needed
- einstein.uwm
-
my.cnf.db2
-
my.cnf.db4
-
config.xml.db2
-
config.xml.db4 (change replica and main)
-
config_readonly.db4
-
config_readonly.db3
-
config_replica.db4
-
config_replica.db3
- einstein-db4.uwm
-
/root/add_eah_user.sh
-
/etc/my.cnf.production
Switch Process
-
Thursday, 20th
- Work on detailed plan
- Agree on timeline and plan details
-
Friday, 21st
- Assign tasks/responsibilities
- Finalize plan
- Sign-off final plan
- Post news item about planned downtime
-
Monday, 24th
- Final preparations
- Double-check plan! Any open questions? Is everything in place?
- Start working on infrastructure
-
make sure everyone involved can access all servers (in particular: einstein-db3/4/5)
- einstein-db4
-
Setup openvpn (see einstein-db2: /etc/openvpn)
-
Setup iptables (see einstein-db2)
-
Setup MySQL config (see einstein-db2: my.cnf)
- Tuesday, 25th
-
Shutdown project (14:00 UTC)
-
stop (bin/stop
) project Hannover (wait until stoppped!)
-
stop (bin/stop
) project UWM (einstein.uwm), touch stop_web
-
write a note into ~/EinsteinAtHome/html/user/motd.php
since news forum won't be available
-
Prepare migration
- einstein-db2
-
make sure all clients/sessions have stopped (other than slaves)
- einstein-db3
-
stop slave
- einstein-db4
-
stop slave
-
move master.info and relay-log.info to /raidc/mysql.savedfiles (we do not want it replicate anymore)
- einstein-db
-
stop slave
-
change VPN config (db2->db4) and test it (at least ping)
- einstein-db2
-
make sure all clients/slaves/sessions have stopped
- einstein-wug
-
change VPN config (db2->db4) and test it (at least ping)
-
Do migration
- einstein-db2
- create dump of old master/production DB
-
Flush and lock all tables: FLUSH TABLES WITH READ LOCK;
-
Quit the MySQL shell
-
Dump the database: time mysqldump --opt --verbose -p EinsteinAtHome --master-data=2 > /raidc/EinsteinAtHome_db4_20110125.dump
-
Transfer the database to Hannover: globus-url-copy -tcp-bs 1048576 -vb -p 512 file:/raidc/EinsteinAtHome_db2_20110125.dump gsiftp://einsteinhsm@atlas1.atlas.aei.uni-hannover.de/home/einsteinhsm/
-
run sha1sum
on dump and note down hash value (einstein-db.aei, db3,db4,db2 all verified .sha1 file9cad03d1661a0beb24915da1a6706efde3dede55
-
note privileges of user eah
SHOW GRANTS
-
lock out eah
user (don't delete it!) REVOKE ALL PRIVILEGES, GRANT OPTION FROM eah
-
unplug network cable
- einstein-db4
-
Drop database EinsteinAtHome
and recreate it
-
Create e dedicated user account for replication
- Log in to MySQL as
root
- Repeat for all slaves: using a fixed IP (slave's VPN endpoint!):
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'<SLAVE-IP>' IDENTIFIED BY '<REPLICATOR-PASSWORD>';
-
restore EinsteinAtHome schema from dump mysql -p EinsteinAtHome < /path/to/EinsteinAtHome.dump
-
add eah
user
-
grant privileges according to what was set up on einstein-db2 (noted down earlier) -> Example: GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, SHUTDOWN, PROCESS, FILE, REFERENCES, INDEX, ALTER, SHOW DATABASES, SUPER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE ON *.* TO 'eah'@'%'
/root/eah_user_add.sh
-
Update settings: FLUSH PRIVILEGES;
-
Save "show master status" mysql -e "show master status;" > /raidc/EinsteinAtHome_db2_20110125.dump.db4_master_status
[root@einstein-db4 ~]# mysql -e "show master status;"
+-------------------+-----------+----------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+-----------+----------------+------------------+
| binary_log.000143 | 357650341 | EinsteinAtHome | |
+-------------------+-----------+----------------+------------------+
-
Determine table checksums of user
, host
, workunit
, result
, file
, fileset
, fileset_file
and sched_trigger
CHECKSUM TABLE user EXTENDED;
- Table
user
: 3784248569
- Table
host
: 2593060393
- Table
workunit
: 351601422
- Table
result
: 1212766670
- Table
file
: 2008314790
- Table
fileset
: 465860950
- Table
fileset_file
: 4097335178
- Table
sched_trigger
: 1852664575
-
Check / restore / set up indexes
-
Analyze tables user
, host
, workunit
, result
, file
, fileset
, fileset_file
and sched_trigger
ANALYZE TABLE user;
- einstein-db3 (in parallel to einstein-db4!)
-
Drop database EinsteinAtHome
and recreate it
-
Restore EinsteinAtHome schema from dump mysql -p EinsteinAtHome < /path/to/EinsteinAtHome.dump
-
Verify table checksums of user
, host
, workunit
, result
, file
, fileset
, fileset_file
and sched_trigger
(see above)
- einstein-db
-
Create dump (in parallel to einstein-db2!)
-
Post-processing
- einstein-db3 (set up replica)
-
Check my.cnf and make sure new master is set
-
Change master log file and position (based on show master status above) CHANGE MASTER TO MASTER_HOST='192.168.10.14', MASTER_USER='replicator', MASTER_PASSWORD='<REPLICATOR-PASSWORD>', MASTER_PORT=3306, MASTER_LOG_FILE='binary_log.000143', MASTER_LOG_POS=357650341;
-
Start slave thread
- einstein-db4
-
configure/check nagios nrpe
- Wednesday, 26th
- Post-processing
- einstein-db (set up replica)
-
Verify sha1 hash of dump available in Hannover (see original hash value above)
-
Drop database EinsteinAtHome
and recreate it
-
Restore EinsteinAtHome schema from dump mysql -p EinsteinAtHome < /path/to/EinsteinAtHome.dump
-
Verify table checksums of user
, host
, workunit
, result
, file
, fileset
, fileset_file
and sched_trigger
(see above)
-
user
, host
, file
, fileset
, fileset_file
and sched_trigger
-
workunit
, result
differ!!! (note: these are the only one that contain TIMESTAMP columns)
-
Tried a second restore, same result/checksums (workunit
988066414, result
2946951805) as before!
- This means that there was no import issue
- This means that this was not caused by a running slave thread (this was the case during/after the first restore)
- Note: there was no 5.0 release mentioning an "Incompatible Change" since 5.0.51, meaning the TIMESTAMP columns shouldn't be responsible for this
- Both affected tables contain the same number of records as the ones on db4 and db3
-
Since db3 isn't really used/required right now:
- Drop the current DB and restore it from the original dump to verify the original checksums again
- It could be that the original checksums don't reflect the dump as they were taken after db3 was attached to db4. However, this should not change db4's tables in any way, and both server's checksums agreed!
- In the meantime, let's start the slave thread as nothing happened...
- We should also verify that the different versions (5.0.51a@AEI vs 5.0.83@UWM) aren't the root cause of this problem
-
Change my.cnf and remove all master*
entries that will be set by CHANGE MASTER TO
below
-
Change master log file and position (based on show master status above) CHANGE MASTER TO MASTER_HOST='192.168.10.14', MASTER_USER='replicator', MASTER_PASSWORD='<REPLICATOR-PASSWORD>', MASTER_PORT=3306, MASTER_LOG_FILE='binary_log.000143', MASTER_LOG_POS=357650341;
-
Start slave thread
- start up project (10:00 UTC)
- verify that ALL clients (e.g. config.xml) use the new password for user
eah
- einstein.uwm.edu
-
Change BOINC configs (xml)
-
Change /etc/my.cnf
-
Start project
-
immediately disable scheduler
- einstein-wug.aei.uni-hannover.de
-
Change BOINC configs (xml)
-
Start project
-
reconfigure nagios-nrpe (db2 -> db4)
- einstein-db
-
reconfigure nagios-nrpe (db2 -> db4)
- einstein.uwm.edu
-
edit config.xml
for locality work <= 20%
-
enable scheduler when enough unsent BRP3 tasks
-
remove note from ~/EinsteinAtHome/html/user/motd.php
- Thursday, 27th
- Monitor project (I/O and CPU load, log files, query execution times, locks)
- Fix issues
- Replica table checksum issue:
- einstein-db4
+------------------------------+------------+
| Table | Checksum |
+------------------------------+------------+
| EinsteinAtHome.user | 512451807 |
| EinsteinAtHome.host | 4278721571 |
| EinsteinAtHome.workunit | 3297539744 |
| EinsteinAtHome.result | 1511958947 |
| EinsteinAtHome.file | 2008314790 |
| EinsteinAtHome.fileset | 465860950 |
| EinsteinAtHome.fileset_file | 4097335178 |
| EinsteinAtHome.sched_trigger | 2174727538 |
+------------------------------+------------+
8 rows in set (57 min 46.51 sec)
- einstein-db3
+------------------------------+------------+
| Table | Checksum |
+------------------------------+------------+
| EinsteinAtHome.user | 512451807 |
| EinsteinAtHome.host | 4278721571 |
| EinsteinAtHome.workunit | 3297539744 |
| EinsteinAtHome.result | 1511958947 |
| EinsteinAtHome.file | 2008314790 |
| EinsteinAtHome.fileset | 465860950 |
| EinsteinAtHome.fileset_file | 4097335178 |
| EinsteinAtHome.sched_trigger | 2174727538 |
+------------------------------+------------+
8 rows in set (57 min 1.42 sec)
- einstein-db
+------------------------------+------------+
| Table | Checksum |
+------------------------------+------------+
| EinsteinAtHome.user | 512451807 |
| EinsteinAtHome.host | 4278721571 |
| EinsteinAtHome.workunit | 2402116790 |
| EinsteinAtHome.result | 2284962296 |
| EinsteinAtHome.file | 2008314790 |
| EinsteinAtHome.fileset | 465860950 |
| EinsteinAtHome.fileset_file | 4097335178 |
| EinsteinAtHome.sched_trigger | 2174727538 |
+------------------------------+------------+
8 rows in set (10 min 20.81 sec)
MySQL 5.0.51a (different database for restore)
+----------------------------+------------+|
Table | Checksum |
+----------------------------+------------+
| restore_test.user | 3784248569 |
| restore_test.host | 2593060393 |
| restore_test.workunit | 988066414 |
| restore_test.result | 2946951805 |
| restore_test.file | 2008314790 |
| restore_test.fileset | 465860950 |
| restore_test.fileset_file | 4097335178 |
| restore_test.sched_trigger | 1852664575 |
+----------------------------+------------+
MySQL 5.0.83
+----------------------------+------------+
| Table | Checksum |
+----------------------------+------------+
| restore_5083.user | 3784248569 |
| restore_5083.host | 2593060393 |
| restore_5083.workunit | 351601422 |
| restore_5083.result | 1212766670 |
| restore_5083.file | 2008314790 |
| restore_5083.fileset | 465860950 |
| restore_5083.fileset_file | 4097335178 |
| restore_5083.sched_trigger | 1852664575 |
+----------------------------+------------+
- Conclusion: again, the AEI replica tables
workunit
and result
differ!
- Tried to verify that the
TIMESTAMP
columns in are not responsible for this by using the following test table and running an INSERT
followed by an UPDATE
, generating the table checksum each time. There were no differences across all DB instances! This doesn't, however, exclude that during heavy load the automatic timestamps might produce different values (TIMESTAMP
resolution: 1 second), say when replication is delayed by more than a second!
CREATE TABLE `tstest` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(254) NOT NULL default '',
`mod_time` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=92022422 DEFAULT CHARSET=latin1;
- Friday, 28th
- Monitor project (I/O and CPU load, log files, query execution times, locks)
- Make sure the project survives the weekend
Failure Plan
What do we do if einstein-db4 crashes? Right now I think we should switch to einstein-db3 unless we know it will have the same problem. A detailed plan should be added here.
--
DavidHammer - 21 Sep 2010
I don't think that we should switch to yet another server right away. This decision should depend on what happened and when:
- If we can't get db4 (the new master) up and running (project still offline) we should back out by putting db2 (the old master) into place again
- If db4 crashes (what does this mean?) after the project was started up we should
- stop the project if possible
- analyse what's going on and then
- fix the problem if possible
- or use the replica on db3 if possible (this is only the case if the master DB itself is consistent (hence the failure is hardware-related) and the replica on db3 is up-to-date!)
--
OliverBehnke - 20 Jan 2011