概念
主從複製是 MySQL 中一個非常常用的功能,它允許一個數據庫服務器作為主服務器,另一個數據庫服務器作為從服務器,主服務器的數據會被複制到從服務器,從而實現數據的備份和容災。
主從交接的數據:二進制日誌文件(Binary Log),主服務器的所有修改操作都會記錄在這個日誌文件中,通過將日誌文件傳送給從服務器,從服務器用來執行這個日誌文件,就可以實現數據的同步功能。
這點和Redis的主從相似,其中都會涉及到日誌文件的當前位置,記錄主從數據的差異點,進行數據同步,因為不可能每次都進行全量同步。
故障記錄
因查詢緩存和MTS的兼容性bug導致SQL線程停止
業務發現故障:在實時報表中數據不更新
通過在從庫中執行:show slave status ,發現其中SQLSlave_SQL_Running停止了,從而判斷是主從複製出現故障。
在從庫MySQL的錯誤日誌error.log中,發現日誌:
2025-12-25T07:17:03 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
...
stack_bottom = 7f3d19f9ee68 thread_stack 0x40000
/data01/local/mysql/bin/mysqld(my_print_stacktrace+0x35)[0xfa8fd5]
/data01/local/mysql/bin/mysqld(handle_fatal_signal+0x4b9)[0x825999]
/lib64/libpthread.so.0(+0xf630)[0x7f4e1a71e630]
/data01/local/mysql/bin/mysqld(pfs_start_rwlock_wait_v1+0x1b)[0x1238d3b]
/data01/local/mysql/bin/mysqld(_ZN11Query_cache27invalidate_query_block_listEP3THDP23Query_cache_block_table+0x53)[0xd1b763]
/data01/local/mysql/bin/mysqld(_ZN11Query_cache16invalidate_tableEP3THDPhm+0x40)[0xd1b890]
/data01/local/mysql/bin/mysqld(_ZN11Query_cache27invalidate_locked_for_writeEP10TABLE_LIST+0x99)[0xd1b9f9]
/data01/local/mysql/bin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0xa93)[0xf37263]
/data01/local/mysql/bin/mysqld(_Z27slave_worker_exec_job_groupP12Slave_workerP14Relay_log_info+0x175)[0xf88515]
其中:mysqld出現嚴重錯誤,信號11(內存訪問越界),在進程棧中發現有_ZN11Query_cache,是在查詢緩存模塊中,崩潰的,然後在下面的日誌中,發現有MTS執行間隙恢復事務,用於恢復崩潰期間未執行的事務。
2025-12-25T07:17:19.552496Z 0 [Note] MTS recovery: starting coordinator thread to fill MTS gaps.
2025-12-25T07:17:19.557734Z 1 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'master-bin.000305' at position 407803646, relay log './0002-relay-bin.000592' position: 407803861
2025-12-25T07:17:19.558326Z 1 [Note] Slave for channel '': MTS Recovery has completed at relay log ./0002-relay-bin.000592, position 407805299 master log master-bin.000305, position 407805084.
2025-12-25T07:17:19.559725Z 1 [Note] Slave SQL thread stopped according to UNTIL SQL_AFTER_MTS_GAPS as it has processed all gap transactions left from the previous slave session.
2025-12-25T07:17:19.559884Z 1 [Note] Slave SQL thread for channel '' exiting, replication stopped in log 'master-bin.000305' at position 407805084
2025-12-25T07:17:19.559941Z 0 [Warning] Recovery from master pos 407805084 and file master-bin.000305 for channel ''. Previous relay log pos and relay log file had been set to 407805299, ./0002-relay-bin.000592 respectively.
2025-12-25T07:17:19.560375Z 0 [Note] MTS recovery: completed successfully.
通過日誌發現,這個是MTS gaps正常的工作機制,觸發這個機制的原因是MySQL異常停止的原因,這個機制用來保證數據一致性(在從庫中事務的執行順序要正確)
MTS Gaps
- 在MTS並行複製中,多個工作線程並行應用事務
- 當從庫異常停止(如崩潰、重啓)時,某些工作線程可能已經執行了事務,但其他線程還在執行
- 這就產生了"間隙"(gaps)- 部分已提交、部分未提交的事務序列
恢復過程:
- 檢測Gaps:啓動時,MySQL會檢測是否有未完成的gap事務
- 單線程恢復:SQL線程以單線程模式運行,專門處理這些gap事務
- 恢復完成:所有gap事務處理完畢後,SQL線程自動停止,依據
UNTIL SQL_AFTER_MTS_GAPS
臨時解決:把sql線程啓動了就行,通過執行命令:
-- 先檢查IO線程狀態
SHOW SLAVE STATUS
-- 如果Slave_IO_Running為Yes,只需要啓動SQL線程
START SLAVE SQL_THREAD;
-- 或者
START SLAVE;
-- 檢查查詢緩存是否開啓
SHOW VARIABLES LIKE 'query_cache%';
-- 通過命令行立即關閉查詢緩存功能
SET GLOBAL query_cache_type = OFF;
-- 將查詢緩存大小設為0(立即生效)
SET GLOBAL query_cache_size = 0;
根本解決:問題的原因通過日誌發現,是在查詢緩存那裏出現的問題,通過查閲是查詢緩存和MTS有兼容性bug,通過禁用查詢緩存,my.cnf:
[mysqld]
# 確保查詢緩存關閉
query_cache_type = 0
query_cache_size = 0
通過臨時解決後,會看見日誌的:
2025-12-25T08:14:25.013567Z 2100 [Note] Slave I/O thread for channel '': connected to master 'slave@001.novalocal:xxxx',replication started in log 'master-bin.000305' at position 454685976
2025-12-25T08:14:25.020009Z 2101 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'master-bin.000305' at position 407805084, relay log './0002-relay-bin.000593' position: 4