Overview

In general, container can communicate with external network in this k8s + Calico scenario:

  • the first opton is by doing NAT on the k8s node
    • This is the “default” behavior. Any outgoing traffic from the container is SNAT by the node/host.
      • “Default” here is not only limited for Calico, but also default behavior for many other netwworking plugin including flanel.
    • It is the simplest method with the main limitation:
      • need to configure port forwarding for each service hosted inside k8s container
      • In my opinion, this is the main problem that Project Calico want to solve.
  • the other option is by doing end-to-end IP routing
    • Project Calico is trying to simplify the container connectivity by simply providing end to end L3 routing
    • This mean, container connectivity is not different compared to baremetal connectivity.

In the previous post, we found out the traffic flow between one container to the other container within the same k8s cluster. Now we are going to connect the k8s cluster to the external network.

As we know, and to summarize the previous finding:

  • by default, each k8s node has full mesh BGP peering relationship between each other.
  • Calico IPAM by default split the IP pool range to multiple /26 subnet.
  • Once a node need to run a container from a specific IP Pool, /26 will be assigned to that node.
  • Each node advertises its local container subnet to other nodes via BGP
    • if Calico cni is used with non Calico IPAM, each node may advertise each container IP as /32 route.

So, our next step is to connect the k8s cluster to the external network via BGP.

The definition of “external network” here is whatever the k8s node uplink connection connected to.

  • If k8s node is a baremetal node, most likely the IP gateway for the node is either layer 3 TOR switch or access router.
  • If k8s node is a VM, then the gateway would be the VM hypervisor
    • this is what i have in my test setup, each k8s node here is a VM inside Openstack with Contrail as networking plugin
    • In our case, each k8s node need to advertise its container prefix(es) to Contrail via BGP.
    • Fortunately, Contrail does have BGP as a service feature.

Target topology

The following diagram is similar as the one in step 2.

                     Internet
                         +                         
                         |                        +-------------+ 
                   +---------------+              | vmx gateway | 
                   | Internet GW   |              |             | 
                   +---------------+              +-------------+ 
             192.168.1.1 |                                 | 192.168.1.22 
                         |                                 |
   +---+-----------------+--------------+------------------+-------------------------------------------------------------------+--------+
       |                                |                                                                                      |
192.168.1.19                       192.168.1.18                                                                         192.168.1.142
       |                                |                                                                                      |
       |                                |                                                                                      |
  +----+----+   +------------------------------------------------------------------------------------------------+        +----+-------+
  |Contrail |   |                       |                                                                        |        |  Test PC   |
  |Control  |   |                     vrouter                                                                    |        |            |
  +---------+   |                       |                                                                        |        +------------+
                |                       |               openstack net1 100.64.1.0/24                             | 
                |    +-------+----------+------------------------------------------------------------------+     |
                |            |                                              |                                    |
                |            |                                              |                                    |
                |       100.64.1.23  ubuntu-4 k8s node                 100.64.1.24  ubuntu-3 k8s node            |
                |      +-----+------------------------------+         +-----+------------------------------+     |
                |      |     |                              |         |     |                              |     |
                |      |     |     10.201.0.192/26          |         |     |     10.201.0.128/26          |     |
                |      |   +-+----------------+--------+    |         |   +-+----------------+--------+    |     |
                |      |     |                |             |         |     |                |             |     |
                |      |     |           .196 |             |         |     |           .130 |             |     |
                |      |     |     +-----------------+      |         |     |     +-----------------+      |     |
                |      |     |     |                 |      |         |     |     |                 |      |     |
                |      |     |     |  container 11   |      |         |     |     |  container 21   |      |     |
                |      |     |     +-----------------+      |         |     |     +-----------------+      |     |
                |      |     |                              |         |     |                              |     |
                |      |     |                              |         |     |                              |     |
                |      |     |     10.91.2.0/26             |         |     |       10.91.1.128/26         |     |
                |      |   +-+----------------+--------+    |         |   +-+----------------+--------+    |     |
                |      |     |                |             |         |     |                |             |     |
                |      |     |             .0 |             |         |     |           .128 |             |     |
                |      |     |     +-----------------+      |         |     |     +-----------------+      |     |
                |      |     |     |                 |      |         |     |     |                 |      |     |
                |      |     |     |  container 12   |      |         |     |     |  container 22   |      |     |
                |      |     |     +-----------------+      |         |     |     +-----------------+      |     |
                |      |                                    |         |                                    |     |
                |      |                                    |         |                                    |     |
                |      +------------------------------------+         +------------------------------------+     |
                |                                                                                                |
                | Compute node                                                                                   |
                +------------------------------------------------------------------------------------------------+

Components

  • k8s node 1:
    • IP: 100.64.1.23
    • hostname: ubuntu-4
    • role: k8s master and worker node
  • k8s node 2:
    • IP: 100.64.1.24
    • hostname: ubuntu-3
    • role: worker node
  • Notes:
    Although my test setup will have k8s nodes running as a VM on top of Openstack, it is not a mandatory requirement. You can have k8s on baremetal and directly connected to physical L2/L3 switches.

Establish BGP Peering between each k8s node to Contrail control node

Setup the connection

  • on k8s side, use calictl to configure new BGP peering

    • go to inside calicoctl pod

        ubuntu@ubuntu-4:~$ kubectl exec -ti -n kube-system calicoctl -- /bin/busybox sh
      
    • create yaml file for BGP definistion

        ~ # cat >> bgp.yaml << EOF
        > apiVersion: v1
        > kind: bgpPeer
        > metadata:
        >   peerIP: 100.64.1.1
        >   scope: global
        > spec:
        >   asNumber: 64512
        > EOF
      
        * Notes: 
            * Contrail default GW for sbnet where the k8s node reside is 100.64.1.1
            * Contrail ASN is 64512
      
    • create the new BGP peers

        ~ # calicoctl create -f bgp.yaml
        Successfully created 1 'bgpPeer' resource(s)
      
      
    • Verify the new config

        ~ # calicoctl get -o yaml bgppeer
        - apiVersion: v1
          kind: bgpPeer
          metadata:
            peerIP: 100.64.1.1
            scope: global
          spec:
            asNumber: 64512
        ~ #
      
  • On Contrail control node, use the Contrail GUI to configure BGP peering
    • Go to Contrail UI -> Configure -> Services -> BGP as a Service
    • Add new peer, and fill in all the necessart input.
    • Verify the new BGP peering on contrail
      • go to Monitor -> Infrastructure -> Control nodes -> select the control node
    • More detailed steps to configure and verify BGP on Contrail side can be found in the previous post about Contrail BGPaaS with Docker and Calico
  • Verify BGP Peering status from each node

    • For this, we need to get the calicoctl binary on each node

        wget https://github.com/projectcalico/calico/releases/download/v2.6.2/release-v2.6.2.tgz
        tar -xzvf release-v2.6.2.tgz
        cp release-v2.6.2/bin/calicoctl .
        rm -rf release-v2.6.2
      
    • run calicoctl on ubuntu-4 node

        ubuntu@ubuntu-4:~$ sudo ./calicoctl node status
        Calico process is running.
      
        IPv4 BGP status
        +--------------+-------------------+-------+------------+--------------------------------+
        | PEER ADDRESS |     PEER TYPE     | STATE |   SINCE    |              INFO              |
        +--------------+-------------------+-------+------------+--------------------------------+
        | 100.64.1.1   | global            | up    | 2017-10-21 | Established                    |
        | 100.64.1.24  | node-to-node mesh | up    | 2017-10-21 | Established                    |
        +--------------+-------------------+-------+------------+--------------------------------+
      
        IPv6 BGP status
        No IPv6 peers found.
      
    • ubuntu-3 node

        ubuntu@ubuntu-3:~$  sudo ./calicoctl node status
        Calico process is running.
      
        IPv4 BGP status
        +--------------+-------------------+-------+------------+--------------------------------+
        | PEER ADDRESS |     PEER TYPE     | STATE |   SINCE    |              INFO              |
        +--------------+-------------------+-------+------------+--------------------------------+
        | 100.64.1.1   | global            | up    | 2017-10-21 | Established                    |
        | 100.64.1.23  | node-to-node mesh | up    | 2017-10-21 | Established                    |
        +--------------+-------------------+-------+------------+--------------------------------+
      
        IPv6 BGP status
        No IPv6 peers found.
      

Verify the packet flow

  • Verify if container subnets are already received by MX gateway

      rw@gw-01> show route table contrail-public.inet.0 
    
      contrail-public.inet.0: 14 destinations, 14 routes (14 active, 0 holddown, 0 hidden)
      + = Active Route, - = Last Active, * = Both
    
      0.0.0.0/0          *[Static/5] 2d 07:03:17
                          > to 100.64.0.1 via lt-0/0/10.11
      ...deleted..
      10.91.1.128/26     *[BGP/170] 00:08:56, localpref 100, from 192.168.1.19
                            AS path: 64500 I, validation-state: unverified
                          > via gr-0/0/10.32769, Push 51
      10.91.2.0/26       *[BGP/170] 00:08:56, localpref 100, from 192.168.1.19
                            AS path: 64500 I, validation-state: unverified
                          > via gr-0/0/10.32769, Push 51
      10.201.0.128/26    *[BGP/170] 00:08:56, localpref 100, from 192.168.1.19
                            AS path: 64500 I, validation-state: unverified
                          > via gr-0/0/10.32769, Push 51
      10.201.0.192/26    *[BGP/170] 00:08:56, localpref 100, from 192.168.1.19
                            AS path: 64500 I, validation-state: unverified
                          > via gr-0/0/10.32769, Push 51
      ...deleted...
      100.64.1.23/32     *[BGP/170] 2d 06:57:58, MED 100, localpref 200, from 192.168.1.19
                            AS path: ?, validation-state: unverified
                          > via gr-0/0/10.32769, Push 51
      100.64.1.24/32     *[BGP/170] 2d 06:57:58, MED 100, localpref 200, from 192.168.1.19
                            AS path: ?, validation-state: unverified
                          > via gr-0/0/10.32769, Push 49
      ...deleted...
    
    • From the above output we can see that Cailco IP Pool for container IP, 10.201.x.x and 10.91.x.x, have been received by external gateway
    • For more detailed routing verification please refer to Contrail BGPaaS with Docker and Calico
  • Now it’s time to verify the connection to each container. Let’s check what is the IP of each container and which node are they hosted.

      ubuntu@ubuntu-4:~$ kubectl get pods -o wide
      NAME                      READY     STATUS    RESTARTS   AGE       IP             NODE
      ssh-server                1/1       Running   0          2d        10.91.1.128    ubuntu-3
      ssh-server3               0/1       Pending   0          1d        <none>         <none>
      ssh-server4               1/1       Running   0          1d        10.91.2.0      ubuntu-4
      sshd-1-84c4bf4558-284dj   1/1       Running   0          2d        10.201.0.197   ubuntu-4
      sshd-2-78f7789cc8-95srr   1/1       Running   0          2d        10.201.0.130   ubuntu-3
      sshd-3-6bb86d6bf8-bq4q8   1/1       Running   0          2d        10.201.0.131   ubuntu-3
    
  • Verify external connection container hosted in node 1: ubuntu-4

      rw@gw-01> ping count 3 10.91.2.0
      PING 10.91.2.0 (10.91.2.0): 56 data bytes
      64 bytes from 10.91.2.0: icmp_seq=0 ttl=61 time=1.819 ms
      64 bytes from 10.91.2.0: icmp_seq=1 ttl=61 time=1.873 ms
      64 bytes from 10.91.2.0: icmp_seq=2 ttl=61 time=1.694 ms
    
      --- 10.91.2.0 ping statistics ---
      3 packets transmitted, 3 packets received, 0% packet loss
      round-trip min/avg/max/stddev = 1.694/1.795/1.873/0.075 ms
    
      rw@gw-01> traceroute no-resolve 10.91.2.0 
      traceroute to 10.91.2.0 (10.91.2.0), 30 hops max, 40 byte packets
       1  100.64.0.2  1.550 ms  0.500 ms  0.750 ms
       2  * * *
       3  10.91.2.0  3.245 ms  1.641 ms  1.691 ms
    
    
    
      rw@gw-01> ping count 3 10.201.0.197        
      PING 10.201.0.197 (10.201.0.197): 56 data bytes
      64 bytes from 10.201.0.197: icmp_seq=0 ttl=61 time=2.708 ms
      64 bytes from 10.201.0.197: icmp_seq=1 ttl=61 time=2.130 ms
      64 bytes from 10.201.0.197: icmp_seq=2 ttl=61 time=75.739 ms
    
      --- 10.201.0.197 ping statistics ---
      3 packets transmitted, 3 packets received, 0% packet loss
      round-trip min/avg/max/stddev = 2.130/26.859/75.739/34.564 ms
    
      rw@gw-01> traceroute no-resolve 10.201.0.197 
      traceroute to 10.201.0.197 (10.201.0.197), 30 hops max, 40 byte packets
       1  100.64.0.2  1.311 ms  0.520 ms  0.976 ms
       2  * * *
       3  10.201.0.197  2.755 ms  1.937 ms  1.835 ms
    
  • Verify external connection container hosted in node 1: ubuntu-3

rw@gw-01> ping 10.91.1.128 
PING 10.91.1.128 (10.91.1.128): 56 data bytes
^C
--- 10.91.1.128 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss

rw@gw-01> traceroute no-resolve 10.91.1.128 
traceroute to 10.91.1.128 (10.91.1.128), 30 hops max, 40 byte packets
 1  100.64.0.2  333.443 ms  0.695 ms  0.418 ms
 2  * * *
 3  * * *
 4  * *^C
* Hmm, ping fail!

Why container in ubuntu-3 can’t communicate with external network ?

Troubleshoot connectivity to container in node 2: ubuntu-3

  • Check the route on mx gateway again.

      rw@gw-01> show route 10.91.1.128 table contrail-public.inet.0 detail 
    
      contrail-public.inet.0: 14 destinations, 14 routes (14 active, 0 holddown, 0 hidden)
      10.91.1.128/26 (1 entry, 1 announced)
              *BGP    Preference: 170/-101
                      Route Distinguisher: 192.168.1.18:1
                      Next hop type: Indirect
                      Address: 0x9985ea8
                      Next-hop reference count: 15
                      Source: 192.168.1.19
                      Next hop type: Router, Next hop index: 629
                      Next hop: via gr-0/0/10.32769, selected
                      Label operation: Push 51
                      Label TTL action: prop-ttl
                      Load balance label: Label 51: None; 
                      Session Id: 0x4
                      Protocol next hop: 192.168.1.18
                      Label operation: Push 51
                      Label TTL action: prop-ttl
                      Load balance label: Label 51: None; 
                      Indirect next hop: 0x9814440 1048578 INH Session ID: 0x5
                      State: <Secondary Active Int Ext ProtectionCand>
                      Local AS: 64512 Peer AS: 64512
                      Age: 1d 10:05:40 	Metric2: 0 
                      Validation State: unverified 
                      Task: BGP_64512.192.168.1.19+14024
                      Announcement bits (1): 1-KRT 
                      AS path: 64500 I
                      Communities: target:64512:1001 target:64512:8000001 unknown iana 30c unknown iana 30c unknown type 8004 value fc00:7a1201 unknown type 8071 value fc00:4
                      Import Accepted
                      VPN Label: 51
                      Localpref: 100
                      Router ID: 192.168.1.19
                      Primary Routing Table bgp.l3vpn.0
    
    
      rw@gw-01> show dynamic-tunnels database 
      Table: inet.3
    
      ..deleted..
    
      Destination-network: 192.168.1.0/24
      Tunnel to: 192.168.1.18/32 State: Up
        Reference count: 10
        Next-hop type: gre
          Source address: 192.168.1.22
          Next hop: gr-0/0/10.32769
          State: Up
    
    
    • From the output above, routing from mx gateway to contrail compute node looks correct.
    • 10.91.1.128 point to next hop interface gr-0/0/10.32769 which is going to 192.168.1.18.
      • 192.168.1.18 is the IP of compute node where the k8s VM node is hosted.

Is it a routing problem ?

  • Verify routing table from Contrail control node

    • Control node routing table to 10.91.1.128

      Contrail Control Node routing table

    • Control node routing table to 10.91.1.128 via ubuntu-4

      Contrail Control Node routing table to Node 1

    • Control node routing table to 10.91.1.128 via ubuntu-3

      Contrail Control Node routing table to Node 2

    • From the output above, contrail control node received the same routes from both k8s node ubuntu-3 and ubuntu-4.

      • Although the container is hosted by ubuntu-3, ubuntu-4 also advertised the same route because there is a full-mesh bgp peer between all the k8s node.
      • So, from control node point of view, everything looks OK.
  • Find out which control-node route that actually used by data-plane
    • In out setup, we kind of lucky, both ubuntu-3 and ubuntu-4 VM are hosted by Openstack compute-1 (192.168.1.18)
    • In this we can check the routing table on compute-1

      Contrail compute node routing table

    • OK, looks like packet going to 10.91.1.128 is sent thru ubuntu-4 node.
  • Based on the result above, in theory, ubuntu-4 should forward the packet to ubuntu-3 via IPIP tunnel between them.

  • Let’s verify if ubuntu-3 node received the packet going to 10.91.1.128 container via IPIP tunnel

      ubuntu@ubuntu-3:~$ sudo tcpdump -n -i tunl0 icmp
      tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
      listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes
      15:41:38.483493 IP 100.64.0.1 > 10.91.1.128: ICMP echo request, id 62374, seq 13, length 64
      15:41:39.492471 IP 100.64.0.1 > 10.91.1.128: ICMP echo request, id 62374, seq 14, length 64
      15:41:40.502981 IP 100.64.0.1 > 10.91.1.128: ICMP echo request, id 62374, seq 15, length 64
      15:41:41.511815 IP 100.64.0.1 > 10.91.1.128: ICMP echo request, id 62374, seq 16, length 64
      15:41:42.522663 IP 100.64.0.1 > 10.91.1.128: ICMP echo request, id 62374, seq 17, length 64
      ^C
    
    • Yup, we can see the ping packet from gateway router to container 10.91.1.128 received on tunl0 interface.
  • Now, let see if the packet is actually forwarded to veth interface between ubuntu-3 node and the container

      ubuntu@ubuntu-3:~$ ip r show 10.91.1.128
      10.91.1.128 dev cali90490b35e30  scope link 
        
      ubuntu@ubuntu-3:~$ sudo tcpdump -n -i cali90490b35e30 icmp
      tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
      listening on cali90490b35e30, link-type EN10MB (Ethernet), capture size 262144 bytes
      ^C
      0 packets captured
      0 packets received by filter
      0 packets dropped by kernel
    
    • Nope, nothing here. Something must blocking it.
  • Verify if IP forwarding is enabled on ubuntu-3 node

      ubuntu@ubuntu-3:~$ sudo sysctl -a | grep forward | grep ipv4 | grep tunl0  
      net.ipv4.conf.tunl0.forwarding = 1
      net.ipv4.conf.tunl0.mc_forwarding = 0
    
      ubuntu@ubuntu-3:~$ sudo sysctl -a | grep forward | grep ipv4 | grep cali90490b35e30
      net.ipv4.conf.cali90490b35e30.forwarding = 1
      net.ipv4.conf.cali90490b35e30.mc_forwarding = 0
    
    • Looks good. Nothing wrong.

Is it a firewall problem ?

  • Maybe it is a firewall. Let’s check IPTable

      ubuntu@ubuntu-3:~$ sudo iptables -L -n
    
    • Hmm, nothing obvious here.
  • Let’s check one more thing, reverse path forwarding rule.

      root@ubuntu-3:/home/ubuntu# sysctl -a | grep rp_filter
      net.ipv4.conf.all.arp_filter = 0
      net.ipv4.conf.all.rp_filter = 1
      net.ipv4.conf.cali06827901978.arp_filter = 0
      net.ipv4.conf.cali06827901978.rp_filter = 1
      net.ipv4.conf.cali90490b35e30.arp_filter = 0
      net.ipv4.conf.cali90490b35e30.rp_filter = 1
      net.ipv4.conf.caliaf4e899510c.arp_filter = 0
      net.ipv4.conf.caliaf4e899510c.rp_filter = 1
      net.ipv4.conf.default.arp_filter = 0
      net.ipv4.conf.default.rp_filter = 1
      net.ipv4.conf.docker0.arp_filter = 0
      net.ipv4.conf.docker0.rp_filter = 1
      net.ipv4.conf.ens3.arp_filter = 0
      net.ipv4.conf.ens3.rp_filter = 1
      net.ipv4.conf.ens4.arp_filter = 0
      net.ipv4.conf.ens4.rp_filter = 1
      net.ipv4.conf.lo.arp_filter = 0
      net.ipv4.conf.lo.rp_filter = 0
      net.ipv4.conf.tunl0.arp_filter = 0
      net.ipv4.conf.tunl0.rp_filter = 1
    
    • Hmm, this could be why. The ping packet is asymmetric
      • incoming: mx gateway -> compute-1 -> ubuntu-4 node -(IPIP)-> ubuntu-3 node -> container 10.91.1.128
      • outgoing: container 10.91.1.128 -> ubuntu-3 node -> compute-1 -> mx gateway

or it is something else ?

  • Let’s disable reverse path filtering

      root@ubuntu-3:/home/ubuntu# sysctl -w net.ipv4.conf.tunl0.rp_filter=0
      net.ipv4.conf.tunl0.rp_filter = 0
      root@ubuntu-3:/home/ubuntu# sysctl -w net.ipv4.conf.ens3.rp_filter=0
      net.ipv4.conf.ens3.rp_filter = 0
      root@ubuntu-3:/home/ubuntu# 
    
    • Yap, that’s it !!!!

    • Now ping works

        rw@gw-01> ping count 3 10.91.1.128 
        PING 10.91.1.128 (10.91.1.128): 56 data bytes
        64 bytes from 10.91.1.128: icmp_seq=0 ttl=61 time=4.976 ms
        64 bytes from 10.91.1.128: icmp_seq=1 ttl=61 time=2.262 ms
        64 bytes from 10.91.1.128: icmp_seq=2 ttl=61 time=3.298 ms
      
        --- 10.91.1.128 ping statistics ---
        3 packets transmitted, 3 packets received, 0% packet loss
        round-trip min/avg/max/stddev = 2.262/3.512/4.976/1.118 ms
      
        rw@gw-01> traceroute no-resolve 10.91.1.128
        traceroute to 10.91.1.128 (10.91.1.128), 30 hops max, 40 byte packets
         1  100.64.0.2  0.831 ms  0.452 ms  1.320 ms
         2  * * *
         3  100.64.1.24  3.203 ms  2.408 ms  2.353 ms
         4  10.91.1.128  2.561 ms  2.138 ms  1.882 ms
      
        rw@gw-01>