背景

做的 iot 项目,需要远程控制设备,使用 websocket 保活. 问题: sim 卡流量有限制,需要降低 sim 卡流量的消耗.

经过优化,流量从原来一次心跳 1k 到一个月心跳未优化前

流量消耗

tcp: keep-alive websocket: ping pong go application: heart

使用 wireshark 抓包后,可以很明显的看出流量的使用情况.总共分为三部分.

服务端每隔 15s 给客户端发送 tcp keep-alive 包. 总共消耗 44 + 56 = 100
服务端每隔 54s 给客户端发送 websocket ping 包. 总共消耗 58 + 56 + 62 + 56 = 241
客户端每隔 60s 给服务端发送 websocket heart 包.总共消耗 79 + 56 = 105 每 60s 消耗流量 (100/15+241/54+105/60)*60
= 772.7 byte

每个月流量 (100/15+241/54+105/60)60602430/1024/1024 = 31.8 Mb

解决方案

根据流量消耗情况,想到以下方案:

调大 tcp keep-alive
关闭 websocket ping pong,使用客户端 heart 保活
客户端 heart 包间隔调长

heart流量链路

client => 负载均衡 => k8s ingress => 应用程序

方案存在的问题: 在各级负载均衡中都需要配置,增加 keepalive 时间,保持长连接

相关配置

linux 配置

tcp_keepalive_intvl (integer; default: 75; since Linux 2.4)
      The number of seconds between TCP keep-alive probes.

tcp_keepalive_probes (integer; default: 9; since Linux 2.2)
      The maximum number of TCP keep-alive probes to send before
      giving up and killing the connection if no response is
      obtained from the other end.

tcp_keepalive_time (integer; default: 7200; since Linux 2.2)
      The number of seconds a connection needs to be idle before
      TCP begins sending out keep-alive probes.  Keep-alives are
      sent only when the SO_KEEPALIVE socket option is enabled.
      The default value is 7200 seconds (2 hours).  An idle
      connection is terminated after approximately an additional
      11 minutes (9 probes an interval of 75 seconds apart) when
      keep-alive is enabled.

      Note that underlying connection tracking mechanisms and
      application timeouts may be much shorter.

linux 一般都无需配置,使用默认的即可,应用层的心跳包一般也不可能超过 2 小时.

http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_timeout

nginx 配置:

keepalive_timeout 超过多少时间没有数据,nginx 主动断开连接.
keepalive_requests 一个连接能发送的总请求数
proxy_http_version 1.1; 跟后端服务器使用的 http 版本,nginx 默认是 1.0,推荐设置为 1.1.特别是对于使用 upstream 设置 keepalive 时,更要设置为 1.1
版本官网keepalive
proxy-connect-timeout 跟后端服务器连接时间,文档上说不要超过 75 秒.为什么是 75 呢?难道是因为默认的 keepalive_timeout 为 75?待确认
proxy_read_timeout 如果后端服务器在这个时间之内没有发送数据,连接将被关闭.是两次成功读取的时间间隔,不是整个连接的时间.
proxy_write_timeout 如果nginx在这个时间之内没有发送数据给后端服务器,连接将被关闭.是两次成功写入的时间间隔,不是这个连接的时间.

需要注意的是: tcp keep-alive 探测报文并不能重置 nginx 的 keepalive_timeout 超时时间,一个是 4 层的,一个是 7 层的。那为什么还有发送 keep-alive呢?防止下层设备将连接关闭掉.

k8s ingress 配置跟 nginx 配置基本相同.

设置nginx 与后端服务的长连接.
配置


upstream cab{
    least_conn;
    server 192.168.10.119:8080 max_fails=3 fail_timeout=15s;
    keepalive_timeout 8s;
    keepalive_requests 1000;
    keepalive 10;
}
location / {
    proxy_pass http://cab;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
}

upstream-keepalive-timeout 配置跟后端服务的长连接超时时间
在 nginx-configuration 配置后,无需重启 ingress-controller.

添加 nginx ingress annotation

keep-alive

annotations:
  nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
  nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
  nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
  nginx.ingress.kubernetes.io/keep-alive: "899"
  nginx.ingress.kubernetes.io/keep-alive-requests: "10000"

websocket 应用关闭服务端主动 ping,并增加在 read 数据时,设置 SetWriteDeadline 和 SetReadDeadline时间,
防止收到客服端 heart 时,服务端未延迟 deadline.

// 去除定时器
ticker := time.NewTicker(pingPeriod)
defer func() {
    ticker.Stop()
    c.conn.Close()
}()
for {
    select {
    case <-ticker.C:
        c.conn.SetWriteDeadline(time.Now().Add(writeWait))
        if err := c.conn.WriteMessage(websocket.PingMessage, nil); err != nil {
            log.Error(err)
            return
        }
    }
}

其他问题

nginx 每 15 秒发送 keepalive 包可能的原因是 docker 发送的,需要查看容器的配置验证方法: 使用 linux 直接启动 nginx.已验证,在虚拟机中直接启动 nginx,不通过 docker 启动.同样的配置,虚拟机中不会发送 keepliave 包.

aliyun slb 最大超时连接时间 900s,不影响长连接时间.

Nginx 502 Bad Gateway问题分析与踩过的坑

aws 长连接不能超过 350s

在测试过程中,一直想调长 websocket 长连接的时间,但是就是超不过 350s.超过 350 秒客户端会受到服务端的 RST 包,导致连接断开.
查找 aws 文档,跟 NELB 的connection-idle-timeout
设置有关.

1	Elastic Load Balancing sets the idle timeout value for TCP flows to 350 seconds. You cannot modify this value. Clients or targets can use TCP keepalive packets to reset the idle timeout.

NLB 设置死了 350s,不能调整.

aws 负载均衡 Reset 监控
解决措施: 将 NLB 改成4层 ELB(Classic Load Balance),调大 connection idle timeout
相关资料:

ltinyho's Blog

websocket 流量优化

背景

流量消耗

解决方案

相关配置

其他问题

aws 长连接不能超过 350s