Kafka升级v1.1的错误以及性能总结

发布网友发布时间：2023-05-26 02:10

共1个回答

热心网友时间：2024-10-21 23:25

最近部门使用的Kafka从v0.8.2升级到v1.1.1，遇到了几个错误，记录一下。

在灰度procer的时候，遇到了这个问题。
[org.apache.kafka.common.errors.TimeoutException](http://org.apache.kafka.common.errors.timeoutexception/): Failed to update metadata after 60000 ms.
以为是哪里配置问题，百思不得其解。因为我们业务会比较特殊，会在procer端cache数据到一定量级再send，以为是msg过大导致的，调试了许多参数都不见效。后面查阅github上相关问题看到一个说发送到错误的topic的时候会报错。会去查看的时候才我们集群的自动创建topic功能关了，我们是手动上去创建的，创建错了导致metadata一致获取不到。
竟然没有明显的提示，只是提示metadata获取超时，也是很坑。

灰度完procer以后，在灰度consumer的时候，发现对应的数据偶尔会由突刺现象，上去consumer端看日志的时候，发现了对应的error log
[2020-04-07 22:56:35] [ERROR][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:... Offset commit failed on partition [topic-partition] at offset 277387: The request timed out.]
[2020-04-07 22:43:58] [WARN] [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:... failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by recing the maximum size of batches returned in poll() with max.poll.records.]
根据log提示，加大了 max.poll.interval.ms 以及降低了 max.poll.records 的值，只有好转但是没有彻底的变好。翻阅官方日志发现 max.poll.interval.ms 的默认值已经很大了是5min，不可能是配置的问题导致的。由于有了procer灰度时遇到的事情的经历，我猜想是不是哪里又遇到什么幺蛾子了，于是上去看broker端log，只有consumer离开集群的这种日志。迫不得已去跟组内运维同学请教，他上去看了才发现一台机器的磁盘有问题，导致offset偶尔提交会失败。

我们升级v1.1是为了使用LZ4的压缩算法。通过前后比较，发现对于broker端的流量流入能少50%以上，理论上可以只使用一半的机器就可以应付之前的数据量级了。对于procer端以及consumer端，procer没看到明显的吞吐下降，倒是由于consumer端需要解压，poll的耗时加了不少，但是我们consumer在没增加的情况下依旧抗下了之前的数据量级。总体来说提升还是很大的。