Hot Standby ( ERROR: canceling statement due to conflict with recovery )

jas0n_liu 2013-12-03

展开全文

转自：http://blog.sina.com.cn/s/blog_773523db0101054f.html

--1 执行长时间查询时报错。
skytf=> SELECT count(id) from skytf.tbl_info;
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.

    备注：表 "skytf.tbl_info" 是个大表，光数据就有 12G，这个统计SQL 正常情况下需要2分钟左右完成, 但每次执行到一会儿是，抛出以上错误。根据错误信息，初步估计当在从库上执行查询时，与主库发生了冲突。

--2 网上GOOGLE ，信息如下
   Long running queries on the standby are a bit tricky, because they
might need to see row versions that are already removed on the master.

    备注：意思是说，长时间SQL如果跑在 standby 节点上可以说是一个笑话，因为 standby 节点有可能需要读取主库上被 removed 的数据。

--3 解决方法，修改参数

修改参数,设置成以下值， max_standby_streaming_delay = 300 s;

max_standby_streaming_delay (integer)
When Hot Standby is active, this parameter determines how long the standby server should wait before canceling standby queries that conflict with about-to-be-applied WAL entries, as described in Section 25.5.2. max_standby_streaming_delay applies when WAL data is being received via streaming replication. The default is 30 seconds. Units are milliseconds if not specified. A value of -1 allows the standby to wait forever for conflicting queries to complete. This parameter can only be set in the postgresql.conf file or on the server command line.

Note that max_standby_streaming_delay is not the same as the maximum length of time a query can run before cancellation; rather it is the maximum total time allowed to apply WAL data once it has been received from the primary server. Thus, if one query has resulted in significant delay, subsequent conflicting queries will have much less grace time until the standby server has caught up again.

备注：上面的解释很好理解：当在 Standby 提供应用时，如果 Standby 节点上的 SQL 与接收主库日志发生冲突时，这个参数决定了从库等侍这个查询的时间，默认值为 30 s, 难怪，刚才的统计SQL，执行时间估计在二分钟左右，从而被 Standby 库主动 Cancel 了。也可以将这个参数设置成 -1. 表示 standby 节点永远等侍这个查询，这无疑是有风险的，如果这个查询不结束，那么从库一直处于与主库的中断状态，不会同步主库数据，而会一直等从库这个SQL执行完成, 这里将参数设置成 300s ，是经过了与开发人员的沟通后确定的一个值。

--4 再次执行统计SQL
skytf=> select count(*) from tbl_info;
count
----------
88123735
(1 row)

Time: 131068.569 ms

备注：这回终于可以执行了，这个SQL花了二分钟多，低于5分钟。

--5 其它建议
        Another option is to increase vacuum_defer_cleanup_age on the primary server, so that dead rows will not be cleaned up as quickly as they normally would be. This will allow more time for queries to execute before they are cancelled on the standby, without having to set a high max_standby_streaming_delay. However it is difficult to guarantee any specific execution-time window with this approach, since vacuum_defer_cleanup_age is measured in transactions executed on the primary server.

   备注：上面这段话来自手册上的，也是针对从库与主库可能产生冲突时的建议方法，可以设置参数vacuum_defer_cleanup_age, 由于这个参数是以事务数来确定的，在实际操作时很难操作，故不采设置这个参数的方法。

--6 总结
       PostgreSQL 的 Hot Standby 是个好东西，但用从库的时候也要注意，用得不好从库可能拒绝提供服务。