案例,一個MySQL InnoDB Cluster集羣中兩個節點處於MISSING狀態. 接手後通過溝通發現出現這麼一個狀態的大概原因:InnoDB Cluster集羣中三個節點被人同時重啓,
後續反覆重啓折騰,而且出現問題後沒有解決.而是將第一個節點作為引導節點,當前的具體狀態如下所示:(集羣節點信息做了混淆)
MySQL mysql01:3306 ssl JS > cluster.status();
{
"clusterName": "myCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "mysql01:3306",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.",
"topology": {
"mysql01:3306": {
"address": "mysql01:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.0.39"
},
"mysql02:3306": {
"address": "mysql02:3306",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.39"
},
"mysql03:3306": {
"address": "mysql03:3306",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.39"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "mysql01:3306"
}
MySQL mysql01:3306 ssl JS >
正常來説,順利的話,只需要將節點重新加入集羣,恢復數據即可.
MySQL mysql01:3306 ssl JS > cluster.rejoinInstance('mysqladm@mysql02:3306')
Validating instance configuration at mysql02:3306...
This instance reports its own address as mysql02:3306
Instance configuration is suitable.
ERROR: A GTID set check of the MySQL instance at 'mysql02:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.
mysql02:3306 has the following errant GTIDs that do not exist in the cluster:
0d3d8524-a642-11ef-849c-b496919a423e:36017566-36017580
Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to rejoining the instance to the cluster.
Discarding these extra GTID events can either be done manually or by completely overwriting the state of mysql02:3306 with a physical snapshot from an existing cluster member. To achieve this remove the instance from the cluster and add it back using <Cluster>.addInstance() and setting the 'recoveryMethod' option to 'clone'.
ERROR: RuntimeError: The instance 'mysql02:3306' contains errant transactions that did not originate from the cluster.
Cluster.rejoinInstance: The instance 'mysql02:3306' contains errant transactions that did not originate from the cluster. (RuntimeError)
但是如上所示,當時InnoDB Cluster出現問題的時候,節點mysql01的事務不是最新的,但是已經將節點mysql01作為引導節點了, 現在節點mysql02重新加入集羣,出現數據衝突了.
此時,應用程序運行在節點mysql01上(作為單節點)已經運行了一段時間了.也就是説節點mysql01上有大量的新數據. 只能將節點mysql02/mysql03從集羣中移除,然後重新加入集羣.
MySQL mysql01:3306 ssl JS > cluster.removeInstance('mysqladm@mysql02:3306')
ERROR: mysql02:3306 is reachable but has state OFFLINE
To safely remove it from the cluster, it must be brought back ONLINE. If not possible, use the 'force' option to remove it anyway.
Do you want to continue anyway (only the instance metadata will be removed)? [y/N]: yes
The instance will be removed from the InnoDB Cluster.
NOTE: Transaction sync was skipped
NOTE: The instance 'mysql02:3306' is OFFLINE, Group Replication stop skipped.
The instance 'mysql02:3306' was successfully removed from the cluster.
MySQL mysql01:3306 ssl JS > cluster.status();
{
"clusterName": "myCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "mysql01:3306",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.",
"topology": {
"mysql01:3306": {
"address": "mysql01:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.0.39"
},
"mysql03:3306": {
"address": "mysql03:3306",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.39"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "mysql01:3306"
}
將節點mysql02重新加入集羣, 正常來説,這一步應該也很順利. 結果踩了一個大坑. 如下所示
MySQL mysql01:3306 ssl JS > cluster.addInstance('mysqladm@mysql02:3306');
WARNING: A GTID set check of the MySQL instance at 'mysql02:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.
mysql02:3306 has the following errant GTIDs that do not exist in the cluster:
0d3d8524-a642-11ef-849c-b496919a423e:36017566-36017580
WARNING: Discarding these extra GTID events can either be done manually or by completely overwriting the state of mysql02:3306 with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.
Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
Please select a recovery method [C]lone/[A]bort (default Abort): C
Validating instance configuration at mysql02:3306...
This instance reports its own address as mysql02:3306
Instance configuration is suitable.
NOTE: Group Replication will communicate with other members using 'mysql02:3306'. Use the localAddress option to override.
* Checking connectivity and SSL configuration...
A new instance will be added to the InnoDB Cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster...
Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
Clone based state recovery is now in progress.
NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.
* Waiting for clone to finish...
NOTE: mysql02:3306 is being cloned from mysql01:3306
ERROR: The clone process has failed: Error dropping database (can't rmdir './backup/', errno: 17 - File exists) (1010)
ERROR: Error dropping database (can't rmdir './backup/', errno: 17 - File exists)
Cluster.addInstance: Error dropping database (can't rmdir './backup/', errno: 17 - File exists) (RuntimeError)
在加入InnoDB Cluster集羣的時候, 這個環境中MySQL的數據目錄為/data/mysql, 但是備份目錄/data/mysql/backup的掛載點backup放置在/data/mysql下, 而且這個目錄的owner為root,不清楚當初部署的人為什麼這樣腦洞大開,結果導致節點加入Innodb Cluster集羣的時候出現異常(發現踩了好大一個坑). 因為線程將backup當作數據庫刪除時遇到權限錯誤.無法刪除目錄/data/mysql/backup, 詳情請見上面錯誤信息
# df -h
Filesystem Size Used Avail Use% Mounted on
...........................................................
/dev/mapper/vg00-mysql 500G 7.3G 493G 2% /data/mysql
/dev/sdb 3.7T 27G 3.7T 1% /data/mysql/backup
............................................................
異常後,節點重新加入InnoDB Cluster集羣就遇到了元數據報錯信息.無法再加入InnoDB Cluster集羣了.
MySQL mysql01:3306 ssl JS > cluster.addInstance('mysqladm@mysql02:3306');
ERROR: MySQL Error 1356: Failed to execute query on Metadata server mysql02:3306: View 'mysql_innodb_cluster_metadata.v2_this_instance' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them
Cluster.addInstance: Failed to execute query on Metadata server mysql02:3306: View 'mysql_innodb_cluster_metadata.v2_this_instance' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them (MySQL Error 1356)
MySQL mysql01:3306 ssl JS >
此時,嘗試了好幾種方法後,最後採取將節點mysql02中的InnoDB Cluster元數據庫mysql_innodb_cluster_metadata刪除.如下所示:
mysql> SET GLOBAL super_read_only = OFF;
Query OK, 0 rows affected (0.00 sec)
mysql> drop database mysql_innodb_cluster_metadata;
Query OK, 12 rows affected (0.02 sec)
mysql> exit
然後將節點mysql02重新加入MySQL InnoDB Cluster集羣.
MySQL mysql01:3306 ssl JS > cluster.addInstance('mysqladm@mysql02:3306')
NOTE: A GTID set check of the MySQL instance at 'mysql02:3306' determined that it is missing transactions that were purged from all cluster members.
Clone based recovery was selected because it seems to be safely usable.
Validating instance configuration at mysql02:3306...
This instance reports its own address as mysql02:3306
Instance configuration is suitable.
NOTE: Group Replication will communicate with other members using 'mysql02:3306'. Use the localAddress option to override.
* Checking connectivity and SSL configuration...
A new instance will be added to the InnoDB Cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster...
Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
Clone based state recovery is now in progress.
NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.
* Waiting for clone to finish...
NOTE: mysql02:3306 is being cloned from mysql01:3306
** Stage DROP DATA: Completed
NOTE: mysql02:3306 is shutting down...
* Waiting for server restart... ready
* mysql02:3306 has restarted, waiting for clone to finish...
** Stage RESTART: Completed
* Clone process has finished: 730.94 MB transferred in 6 sec (121.82 MB/s)
Incremental state recovery is now in progress.
* Waiting for distributed recovery to finish...
NOTE: 'mysql02:3306' is being recovered from 'mysql01:3306'
* Distributed recovery has finished
The instance 'mysql02:3306' was successfully added to the cluster.
MySQL mysql01:3306 ssl JS > cluster.status()
{
"clusterName": "myCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "mysql01:3306",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE",
"statusText": "Cluster is NOT tolerant to any failures.",
"topology": {
"mysql01:3306": {
"address": "mysql01:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.0.39"
},
"mysql02:3306": {
"address": "mysql02:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.0.39"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "mysql01:3306"
}
節點mysql03加入集羣就非常簡單,這裏就不做過多敍述了. 關於接手項目或幫人處理問題時, 如果環境不熟悉或沒有規範配置,很容易踩坑. 如果你接手一個陌生的生產環境.做任何操作記得謹慎小心.