一位QQ上的网友问我的一个问题,我觉得比较有意思。
我在启动POSTGRES的STANDBY数据库时,报
--------------------------
LOG: database system was shut down in recovery at 2011-12-30 23:20:25 CST
LOG: entering standby mode
LOG: restored log file "0000000100000002000000E9" from archive
LOG: invalid record length at 2/E9000108
LOG: invalid record length at 2/E9000108
LOG: streaming replication successfully connected to primary
LOG: invalid record length at 2/E9000108
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "0000000100000002000000E9" from archive
LOG: invalid record length at 2/E9000108
FATAL: the database system is starting up
LOG: invalid record length at 2/E9000108
LOG: restored log file "0000000100000002000000E9" from archive
LOG: invalid record length at 2/E9000108
--------------------------------------
报错
invalid record length at 2/E9000108 * NOTE: We disallow len == 0 because it provides a useful bit of extra
* error checking in ReadRecord. This means that all callers of
* XLogInsert must supply at least some not-in-a-buffer data. However, we
* make an exception for XLOG SWITCH records because we don't want them to
* ever cross a segment boundary.
if (len == 0 && !isLogSwitch)
elog(PANIC, "invalid xlog record length %u", len);
因此报错的原因是standby在恢复过程中从 xlog
0000000100000002000000E9 的 2/E9000108位置读取到了len == 0 并且 !isLogSwitch 的信息。 日志中还提到了 restored log file "0000000100000002000000E9" from archive 因此问他要了recovery.conf 配置文件的内容如下 : standby_mode = 'on' primary_conninfo = 'host=隐藏 port=5432 user=隐藏 password=隐藏' restore_command = 'cp "/home/postgres/data/pg_xlog/%f" "%p"' trigger_file = '/home/postgres/trigger_activestb' 从这个配置文件来看,
restore_command = 'cp "/home/postgres/data/pg_xlog/%f" "%p"' 有问题,如果要配置 restore_command,那么应该是把主库的归档文件拷贝standby的pg_xlog目录。 要么就是不要配置 restore_command , 那么standby会尝试从stream获取xlog的信息(要求primary库的这个xlog没有被覆盖)。 经过确认主库的
0000000100000002000000E9 存在pg_xlog中未被覆盖. 注释standby节点recovery.conf的restore_command = 'cp "/home/postgres/data/pg_xlog/%f" "%p"' 未解决,还是报
invalid record length at 2/E9000108 做STANDBY的时候有没有按照正常流程来做( select pg_start_backup(), backup, select pg_stop_backup() ) 可以尝试使用pg_filedump来分析一下
0000000100000002000000E9的内容。(由于时间关系,暂时没做) src/backend/access/transam/xlog.c