PostgreSQL高可用之repmgr自動切換

之前寫過一個repmgr的高可用搭建的，https://www.cnblogs.com/wy123/p/18531710，repmgr的搭建過程還是比較簡單的，具體過程不再贅述。這裏為了簡化，做了1主2從的結構，之前一直沒空測試repmgr的手動和自動故障轉移，抽空找了個環境，做了個repmgr的故障轉移測試。

環境：

ubuntu05:192.168.152.111（postgre服務為postgresql9000，repmgr服務為repmgr9000）
ubuntu06:192.168.152.112（postgre服務為postgresql9000，repmgr服務為repmgr9000）
ubuntu07:192.168.152.113（postgre服務為postgresql9000，repmgr服務為repmgr9000）

1，ubuntu05，ubuntu06，ubuntu07是一個repmgr集羣，ubuntu05為主節點，其他兩個為從節點
2，強制關閉ubuntu05上的PostgreSQL服務
3，repmgr完整自動故障轉移，自動提升ubuntu06為這點

repmgr配置

repmgr的配置文件repmgr.conf

node_id=2
node_name='ubuntu06'
conninfo='host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100'
data_directory='/usr/local/pgsql16/pg9000/data'
pg_bindir='/usr/local/pgsql16/server/bin'
priority=80

#自動故障轉移配置
failover=automatic
promote_command='/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file'
follow_command='/usr/local/pgsql16/server/bin/repmgr standby follow -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file --upstream-node-id=%n'
log_file='/usr/local/pgsql16/repmgr/repmgr.log'

#要啓用 repmgrd 守護進程和監控，需在 repmgr.conf中啓用 moitoring_history=yes
monitoring_history=true
#默認監控時間間隔為2秒
monitor_interval_secs=5
#故障轉移之前，嘗試重新連接主庫次數（默認為6）參數
reconnect_attempts=12
#每間隔5s嘗試重新連接一次參數
reconnect_interval=5

repmgrd的systemd服務啓動腳本，設置repmgrd自動啓動

[Unit]
Description=PostgreSQL Replication Manager Daemon
After=network.target postgresql9000.service
Requires=postgresql9000.service

[Service]
Type=forking
User=postgres
Group=postgres
ExecStart=/usr/local/pgsql16/server/bin/repmgrd -f /usr/local/pgsql16/repmgr/repmgr.conf --pid-file /usr/local/pgsql16/repmgr/repmgrd.pid
ExecStop=/bin/kill -QUIT $MAINPID
PIDFile=/usr/local/pgsql16/repmgr/repmgrd.pid
Restart=always
RestartSec=5

# 環境變量（如果需要）
Environment=PATH=/usr/local/pgsql16/server/bin:/usr/local/bin:/usr/bin:/bin

[Install]
WantedBy=multi-user.target

手動切換主從

repmgr的前置條件是需要節點之間ssh互信，

1，手動故障轉移，哪個從節點需要提升為主節點，就在哪個節點上執行：
    /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby switchover --siblings-follow
    --siblings-follow  表示所有從庫的同步源自動改成最新的主庫節點

    switchover的內部流程如下：
    1.關閉當前的主庫 ubuntu06
    2.等待老主庫徹底關閉後，在 ubuntu05 上進行 pg_promote()
    3.重啓啓動老主庫 ubuntu06， 降級成 standby 數據庫， 指向複製源 ubuntu05
    4.sibling nodes兄弟節點同樣進行了複製源重定向，指向 ubuntu05
    5.整個switchover 過程結束
    
    在當前節點Ubuntu04查看集羣狀態
	repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
	postgres@ubuntu05:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
	 ID | Name     | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
	----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
	 1  | ubuntu05 | standby |   running | ubuntu06 | default  | 80       | 2        | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
	 2  | ubuntu06 | primary | * running |          | default  | 80       | 2        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
	 3  | ubuntu07 | standby |   running | ubuntu06 | default  | 60       | 2        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
	postgres@ubuntu05:~$
	postgres@ubuntu05:~$
	postgres@ubuntu05:~$
    
    執行switchover
    postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby switchover --siblings-follow
	NOTICE: executing switchover on node "ubuntu05" (ID: 1)
	NOTICE: attempting to pause repmgrd on 3 nodes
	NOTICE: local node "ubuntu05" (ID: 1) will be promoted to primary; current primary "ubuntu06" (ID: 2) will be demoted to standby
	NOTICE: stopping current primary node "ubuntu06" (ID: 2)
	NOTICE: issuing CHECKPOINT on node "ubuntu06" (ID: 2)
	DETAIL: executing server command "/usr/local/pgsql16/server/bin/pg_ctl  -D '/usr/local/pgsql16/pg9000/data' -W -m fast stop"
	INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
	INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
	INFO: checking for primary shutdown; 3 of 60 attempts ("shutdown_check_timeout")
	INFO: checking for primary shutdown; 4 of 60 attempts ("shutdown_check_timeout")
	INFO: checking for primary shutdown; 5 of 60 attempts ("shutdown_check_timeout")
	INFO: checking for primary shutdown; 6 of 60 attempts ("shutdown_check_timeout")
	NOTICE: current primary has been cleanly shut down at location 0/18000028
	NOTICE: promoting standby to primary
	DETAIL: promoting server "ubuntu05" (ID: 1) using pg_promote()
	NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
	NOTICE: STANDBY PROMOTE successful
	DETAIL: server "ubuntu05" (ID: 1) was successfully promoted to primary
	NOTICE: node "ubuntu05" (ID: 1) promoted to primary, node "ubuntu06" (ID: 2) demoted to standby
	NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
	INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
	NOTICE: switchover was successful
	DETAIL: node "ubuntu05" is now primary and node "ubuntu06" is attached as standby
	NOTICE: STANDBY SWITCHOVER has completed successfully
	postgres@ubuntu05:~$
	postgres@ubuntu05:~$
    
	postgres@ubuntu05:~$
	postgres@ubuntu05:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
	 ID | Name     | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
	----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
	 1  | ubuntu05 | primary | * running |          | default  | 80       | 3        | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
	 2  | ubuntu06 | standby |   running | ubuntu05 | default  | 80       | 2        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
	 3  | ubuntu07 | standby |   running | ubuntu05 | default  | 60       | 2        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
	postgres@ubuntu05:~$

手動故障轉移

1，kill或者停止主節點服務來模擬主節點故障
    systemctl stop postgresql9000
		
2，從節點上查看集羣狀態，此時原始主節點已不可達
	postgres@ubuntu06:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
		ID | Name     | Role    | Status        | Upstream   | Location | Priority | Timeline | Connection string
	----+----------+---------+---------------+------------+----------+----------+----------+------------------------------------------------------------------------------
		1  | ubuntu05 | primary | ? unreachable | ?          | default  | 80       |          | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
		2  | ubuntu06 | standby |   running     | ? ubuntu05 | default  | 80       | 3        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
		3  | ubuntu07 | standby |   running     | ? ubuntu05 | default  | 60       | 3        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100

	WARNING: following issues were detected
		- unable to connect to node "ubuntu05" (ID: 1)
		- node "ubuntu05" (ID: 1) is registered as an active primary but is unreachable
		- unable to connect to node "ubuntu06" (ID: 2)'s upstream node "ubuntu05" (ID: 1)
		- unable to determine if node "ubuntu06" (ID: 2) is attached to its upstream node "ubuntu05" (ID: 1)
		- unable to connect to node "ubuntu07" (ID: 3)'s upstream node "ubuntu05" (ID: 1)
		- unable to determine if node "ubuntu07" (ID: 3) is attached to its upstream node "ubuntu05" (ID: 1)

	HINT: execute with --verbose option to see connection error messages
	postgres@ubuntu06:~$
		
3，手動 promote 把 ubuntu06 提升為主庫
	/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby promote --siblings-follow
	檢查集羣狀態，此時Ubuntu06已經成為主節點，原主庫 pg02 被標記為 failed 的狀態
		
	postgres@ubuntu06:~$
	postgres@ubuntu06:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby promote --siblings-follow
	NOTICE: promoting standby to primary
	DETAIL: promoting server "ubuntu06" (ID: 2) using pg_promote()
	NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
	NOTICE: STANDBY PROMOTE successful
	DETAIL: server "ubuntu06" (ID: 2) was successfully promoted to primary
	NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
	INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
	postgres@ubuntu06:~$
	postgres@ubuntu06:~$###檢查集羣狀態，此時Ubuntu06已經成為主節點
	postgres@ubuntu06:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
		ID | Name     | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
	----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
		1  | ubuntu05 | primary | - failed  | ?        | default  | 80       |          | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
		2  | ubuntu06 | primary | * running |          | default  | 80       | 4        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
		3  | ubuntu07 | standby |   running | ubuntu06 | default  | 60       | 3        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
		
	WARNING: following issues were detected
		- unable to connect to node "ubuntu05" (ID: 1)
		
	HINT: execute with --verbose option to see connection error messages
	postgres@ubuntu06:~$
		
	
4，老主庫重新加入集羣
    4.1 啓動老主庫
		root@ubuntu05:~# systemctl start postgresql9000
		root@ubuntu05:~#
		root@ubuntu05:~# su - postgres
		postgres@ubuntu05:~$
		postgres@ubuntu05:~$
		postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
		 ID | Name     | Role    | Status               | Upstream   | Location | Priority | Timeline | Connection string
		----+----------+---------+----------------------+------------+----------+----------+----------+------------------------------------------------------------------------------
		 1  | ubuntu05 | primary | * running            |            | default  | 80       | 3        | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
		 2  | ubuntu06 | standby | ! running as primary |            | default  | 80       | 4        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
		 3  | ubuntu07 | standby |   running            | ! ubuntu06 | default  | 60       | 3        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
		
		WARNING: following issues were detected
		  - node "ubuntu06" (ID: 2) is registered as standby but running as primary
		  - node "ubuntu07" (ID: 3) reports a different upstream (reported: "ubuntu06", expected "ubuntu05")
		
		postgres@ubuntu05:~$
		
	4.2 執行pg_rewind
		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf node rejoin -d 'host=ubuntu06 dbname=repmgr user=repmgr password=****** port=9000' --force-rewind --dry-run
		 
		postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf node rejoin -d 'host=ubuntu06 dbname=repmgr user=repmgr password=****** port=9000' --force-rewind --dry-run
		NOTICE: rejoin target is node "ubuntu06" (ID: 2)
		INFO: replication connection to the rejoin target node was successful
		INFO: local and rejoin target system identifiers match
		DETAIL: system identifier is 7550951818891860956
		NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
		DETAIL: rejoin target server s timeline 4 forked off current database system timeline 3 before current recovery point 0/1B000028
		INFO: prerequisites for using pg_rewind are met
		INFO: pg_rewind would now be executed
		DETAIL: pg_rewind command is:
		  /usr/local/pgsql16/server/bin/pg_rewind -D '/usr/local/pgsql16/pg9000/data' --source-server='host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100'
		INFO: prerequisites for executing NODE REJOIN are met
		postgres@ubuntu05:~$
		postgres@ubuntu05:~$

		
		或者簡單粗暴，直接刪除本地的數據，重新克隆
		
		克隆數據庫
 		/usr/local/pgsql16/server/bin/repmgr -h 192.168.152.112 -p 9000 -U repmgr -d repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby clone --dry-run
		直接啓動數據庫服務即可
		--取消註冊，實際上是從nodes表中刪除數據
		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby unregister
		--重新註冊，重新將repmgr.conf中的配置加載到nodes表中
		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register
		
		
		--強制註冊force，實際上就是覆蓋現有的配置
		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register --force
		--指定主節點，一般不用指定，直接會根據postgresql.auto.conf找到主節點
		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register  --upstream-node-id=2
		
		
		
		對正常節點重新註冊，目的是修改配置之後，重新註冊會，達到重新加載的功能，從節點(pg02，pg03)進行重新註冊操作
		$ repmgr -f /home/postgres/repmgr/repmgr.conf standby unregister
		$ repmgr -f /home/postgres/repmgr/repmgr.conf standby register --upstream-node-id=1

自動故障轉移

強制關閉主節點Ubuntu05上的PostgreSQL服務模擬故障

自動故障轉移過程如下：

repmgr的轉移過程日誌，可以看到repmgr會根據上面配置文件的重試間隔reconnect_interval和重試參數reconnect_attempts，一直重試，如果最終主節點不可達，開始故障轉移，整個過程為1分鐘

[2025-09-18 13:24:00] [INFO] monitoring connection to upstream node "ubuntu05" (ID: 1)
[2025-09-18 13:26:26] [INFO] node "ubuntu06" (ID: 2) monitoring upstream node "ubuntu05" (ID: 1) in normal state
[2025-09-18 13:26:26] [DETAIL] last monitoring statistics update was 5 seconds ago
[2025-09-18 13:29:01] [INFO] node "ubuntu06" (ID: 2) monitoring upstream node "ubuntu05" (ID: 1) in normal state
[2025-09-18 13:29:01] [DETAIL] last monitoring statistics update was 5 seconds ago
***************************************************這裏開始模擬主節點故障，從節點開始重試*************************************************************************
[2025-09-18 13:30:01] [WARNING] unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"
[2025-09-18 13:30:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:01] [WARNING] unable to connect to upstream node "ubuntu05" (ID: 1)
[2025-09-18 13:30:01] [INFO] checking state of node "ubuntu05" (ID: 1), 1 of 12 attempts
[2025-09-18 13:30:01] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:01] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:02] [WARNING] unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"
[2025-09-18 13:30:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:02] [WARNING] unable to connect to upstream node "ubuntu05" (ID: 1)
[2025-09-18 13:30:02] [INFO] checking state of node "ubuntu05" (ID: 1), 1 of 12 attempts
[2025-09-18 13:30:02] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:02] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:06] [INFO] checking state of node "ubuntu05" (ID: 1), 2 of 12 attempts
[2025-09-18 13:30:06] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:06] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:06] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:07] [INFO] checking state of node "ubuntu05" (ID: 1), 2 of 12 attempts
[2025-09-18 13:30:07] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:07] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:07] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:11] [INFO] checking state of node "ubuntu05" (ID: 1), 3 of 12 attempts
[2025-09-18 13:30:11] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:11] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:11] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:12] [INFO] checking state of node "ubuntu05" (ID: 1), 3 of 12 attempts
[2025-09-18 13:30:12] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:12] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:12] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:16] [INFO] checking state of node "ubuntu05" (ID: 1), 4 of 12 attempts
[2025-09-18 13:30:16] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:16] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:17] [INFO] checking state of node "ubuntu05" (ID: 1), 4 of 12 attempts
[2025-09-18 13:30:17] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:17] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:17] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:22] [INFO] checking state of node "ubuntu05" (ID: 1), 5 of 12 attempts
[2025-09-18 13:30:22] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:22] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:22] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:22] [INFO] checking state of node "ubuntu05" (ID: 1), 5 of 12 attempts
[2025-09-18 13:30:22] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:22] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:22] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:27] [INFO] checking state of node "ubuntu05" (ID: 1), 6 of 12 attempts
[2025-09-18 13:30:27] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:27] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:27] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:27] [INFO] checking state of node "ubuntu05" (ID: 1), 6 of 12 attempts
[2025-09-18 13:30:27] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:27] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:27] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:32] [INFO] checking state of node "ubuntu05" (ID: 1), 7 of 12 attempts
[2025-09-18 13:30:32] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:32] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:32] [INFO] checking state of node "ubuntu05" (ID: 1), 7 of 12 attempts
[2025-09-18 13:30:32] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:32] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:37] [INFO] checking state of node "ubuntu05" (ID: 1), 8 of 12 attempts
[2025-09-18 13:30:37] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:37] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:37] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:37] [INFO] checking state of node "ubuntu05" (ID: 1), 8 of 12 attempts
[2025-09-18 13:30:37] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:37] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:37] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:42] [INFO] checking state of node "ubuntu05" (ID: 1), 9 of 12 attempts
[2025-09-18 13:30:42] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:42] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:42] [INFO] checking state of node "ubuntu05" (ID: 1), 9 of 12 attempts
[2025-09-18 13:30:42] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:42] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:47] [INFO] checking state of node "ubuntu05" (ID: 1), 10 of 12 attempts
[2025-09-18 13:30:47] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:47] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:47] [INFO] checking state of node "ubuntu05" (ID: 1), 10 of 12 attempts
[2025-09-18 13:30:47] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:47] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:52] [INFO] checking state of node "ubuntu05" (ID: 1), 11 of 12 attempts
[2025-09-18 13:30:52] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:52] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:52] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:52] [INFO] checking state of node "ubuntu05" (ID: 1), 11 of 12 attempts
[2025-09-18 13:30:52] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:52] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:52] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2025-09-18 13:30:57] [INFO] checking state of node "ubuntu05" (ID: 1), 12 of 12 attempts
[2025-09-18 13:30:57] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:57] [WARNING] unable to reconnect to node "ubuntu05" (ID: 1) after 12 attempts
[2025-09-18 13:30:57] [INFO] 1 active sibling nodes registered
[2025-09-18 13:30:57] [INFO] 3 total nodes registered
[2025-09-18 13:30:57] [INFO] primary node  "ubuntu05" (ID: 1) and this node have the same location ("default")
[2025-09-18 13:30:57] [INFO] local node's last receive lsn: 0/220000A0
[2025-09-18 13:30:57] [INFO] checking state of sibling node "ubuntu07" (ID: 3)
[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) reports its upstream is node 1, last seen 56 second(s) ago
[2025-09-18 13:30:57] [INFO] standby node "ubuntu07" (ID: 3) last saw primary node 56 second(s) ago
[2025-09-18 13:30:57] [INFO] last receive LSN for sibling node "ubuntu07" (ID: 3) is: 0/220000A0
[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has same LSN as current candidate "ubuntu06" (ID: 2)
[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has lower priority (60) than current candidate "ubuntu06" (ID: 2) (80)
[2025-09-18 13:30:57] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds
[2025-09-18 13:30:57] [NOTICE] promotion candidate is "ubuntu06" (ID: 2)
[2025-09-18 13:30:57] [NOTICE] this node is the winner, will now promote itself and inform other nodes
[2025-09-18 13:30:57] [INFO] promote_command is:
  "/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file"
[2025-09-18 13:30:57] [NOTICE] redirecting logging output to "/usr/local/pgsql16/repmgr/repmgr.log"

[2025-09-18 13:30:57] [WARNING] 1 sibling nodes found, but option "--siblings-follow" not specified
[2025-09-18 13:30:57] [DETAIL] these nodes will remain attached to the current primary:
  ubuntu07 (node ID: 3)
[2025-09-18 13:30:57] [NOTICE] promoting standby to primary
[2025-09-18 13:30:57] [DETAIL] promoting server "ubuntu06" (ID: 2) using pg_promote()
[2025-09-18 13:30:57] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2025-09-18 13:30:57] [INFO] checking state of node "ubuntu05" (ID: 1), 12 of 12 attempts
[2025-09-18 13:30:57] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
[2025-09-18 13:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:57] [WARNING] unable to reconnect to node "ubuntu05" (ID: 1) after 12 attempts
[2025-09-18 13:30:57] [INFO] 1 active sibling nodes registered
[2025-09-18 13:30:57] [INFO] 3 total nodes registered
[2025-09-18 13:30:57] [INFO] primary node  "ubuntu05" (ID: 1) and this node have the same location ("default")
[2025-09-18 13:30:57] [INFO] local node's last receive lsn: 0/220000A0
[2025-09-18 13:30:57] [INFO] checking state of sibling node "ubuntu07" (ID: 3)
[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) reports its upstream is node 1, last seen 56 second(s) ago
[2025-09-18 13:30:57] [INFO] standby node "ubuntu07" (ID: 3) last saw primary node 56 second(s) ago
[2025-09-18 13:30:57] [INFO] last receive LSN for sibling node "ubuntu07" (ID: 3) is: 0/220000A0
[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has same LSN as current candidate "ubuntu06" (ID: 2)
[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has lower priority (60) than current candidate "ubuntu06" (ID: 2) (80)
[2025-09-18 13:30:57] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds
[2025-09-18 13:30:57] [NOTICE] promotion candidate is "ubuntu06" (ID: 2)
[2025-09-18 13:30:57] [NOTICE] this node is the winner, will now promote itself and inform other nodes
[2025-09-18 13:30:57] [INFO] promote_command is:
  "/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file"
[2025-09-18 13:30:57] [NOTICE] redirecting logging output to "/usr/local/pgsql16/repmgr/repmgr.log"

[2025-09-18 13:30:57] [ERROR] STANDBY PROMOTE can only be executed on a standby node
[2025-09-18 13:30:57] [ERROR] promote command failed
[2025-09-18 13:30:57] [DETAIL] promote command exited with error code 8
[2025-09-18 13:30:57] [INFO] checking if original primary node has reappeared
[2025-09-18 13:30:57] [ERROR] connection to database failed
[2025-09-18 13:30:57] [DETAIL] 
connection to server at "192.168.152.111", port 9000 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?

[2025-09-18 13:30:57] [DETAIL] attempted to connect using:
  user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr options=-csearch_path=
[2025-09-18 13:30:57] [WARNING] unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"
[2025-09-18 13:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2025-09-18 13:30:57] [NOTICE] local node is primary, checking local node state
[2025-09-18 13:30:57] [NOTICE] resuming monitoring as primary node after 0 seconds
[2025-09-18 13:30:57] [INFO] 1 followers to notify
[2025-09-18 13:30:57] [INFO] reconnecting to node "ubuntu07" (ID: 3)...
[2025-09-18 13:30:57] [NOTICE] notifying node "ubuntu07" (ID: 3) to follow node 2
INFO:  node 3 received notification to follow node 2
[2025-09-18 13:30:57] [NOTICE] monitoring cluster primary "ubuntu06" (ID: 2)
[2025-09-18 13:30:58] [NOTICE] STANDBY PROMOTE successful
[2025-09-18 13:30:58] [DETAIL] server "ubuntu06" (ID: 2) was successfully promoted to primary
[2025-09-18 13:30:58] [INFO] checking state of node 2, 1 of 12 attempts
[2025-09-18 13:30:58] [NOTICE] node 2 has recovered, reconnecting
[2025-09-18 13:30:58] [INFO] connection to node 2 succeeded
[2025-09-18 13:30:58] [INFO] original connection is still available
[2025-09-18 13:30:58] [INFO] 1 followers to notify
[2025-09-18 13:30:58] [NOTICE] notifying node "ubuntu07" (ID: 3) to follow node 2
INFO:  node 3 received notification to follow node 2
[2025-09-18 13:30:58] [INFO] switching to primary monitoring mode
[2025-09-18 13:30:58] [NOTICE] monitoring cluster primary "ubuntu06" (ID: 2)
[2025-09-18 13:30:58] [INFO] child node "ubuntu07" (ID: 3) is attached
[2025-09-18 13:31:02] [NOTICE] new standby "ubuntu07" (ID: 3) has connected
[2025-09-18 13:35:57] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
[2025-09-18 13:35:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
[2025-09-18 13:40:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
[2025-09-18 13:40:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
[2025-09-18 13:45:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
[2025-09-18 13:45:59] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
[2025-09-18 13:50:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
[2025-09-18 13:50:59] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state

repmgr的優缺點總結

repmgr在高可用方案上，勉強能用吧。
優點是安裝配置都比較簡單，
缺點是沒辦法做到連續自動故障轉移，第一次轉移完成後，故障節點想拉起來，還是要先做手動pg_rewind。
repmgr把元數據保存在本地的PostgreSQL數據庫中，數據庫啓動之前repmgr進程不知道集羣狀態，所以不可能自動rewind，這也就是用PostgreSQL自身保存集羣元數據的缺陷，也算是跟partoni的差距吧。

MSSQL123 動態日志

@wy123

標簽

數據庫 (310)

postgresql (89)

sqlserver (68)

動態

PostgreSQL repmgr 高可用之故障轉移 - 動態詳情

環境：

repmgr配置

手動切換主從

手動故障轉移

自動故障轉移

repmgr的優缺點總結

Add a new 評論

MSSQL123 動態日志

@wy123

標簽

數據庫 (310)

postgresql (89)

sqlserver (68)

動態

PostgreSQL repmgr 高可用之故障轉移 - 動態 詳情

環境：

repmgr配置

手動切換主從

手動故障轉移

自動故障轉移

repmgr的優缺點總結

Add a new 評論

PostgreSQL repmgr 高可用之故障轉移 - 動態詳情