Author: Kyle Bassett


Recently I was building a multi-node OpenShift 3.4 environment in Softlayer - AKA - Bluemix Infrastructure. I ran into an issue with the installer (running via Ansible) that was not very easy to troubleshoot and get to the the root cause. This is a ~quick~ blog to ideally save someone else in a similar position a lot of time, google searching and chasing their tail.

This issue could easily come up in on-prem installations depending on the vm or server configuration. A multi-master setup utilizes etcd and this is where the core issue came up, basically during the install etcd tries to communicate with the other etcd nodes via the API and it times out and fails the install (connection refused). Jumping to the root cause was the fact that my vm’s were configured with 2 interfaces as follows:

**1 for internet access (inbound & out) - 169.55.187.**/28**

**1 for internal access - 10.166.21.128/26**

Most Cloud providers I have used don’t use this type of configuration, they simple give you an internal vlan and have a NAT service to allow internet connectivity inbound and outbound. Int his case we need both as we need to get packages from the internet also and allow access inbound on a few ports.

If you end up with a similar configuration and run into the issue, you will likely see something like this in your ansible log upon failure:

*Sorry for this length of this blog up-front but I am hoping this helps someone who is googleing error messages they find - like I was…

skipping: [ocpmaster2.arctiqlab.local] => {
    "changed": false,
    "skip_reason": "Conditional check failed",
    "skipped": true
}
skipping: [ocpmaster3.arctiqlab.local] => {
    "changed": false,
    "skip_reason": "Conditional check failed",
    "skipped": true
}
FAILED - RETRYING: TASK: openshift_master : Start and enable master api on first master (1 retries left).Result was: {
    "attempts": 1,
    "failed": true,
    "invocation": {
        "module_args": {
            "daemon_reload": false,
            "enabled": true,
            "masked": null,
            "name": "atomic-openshift-master-api",
            "state": "started",
            "user": false
        },
        "module_name": "systemd"
    },
    "retries": 2
}

MSG:

Unable to start service atomic-openshift-master-api: Job for atomic-openshift-master-api.service failed because the control process exited with error code. See "systemctl status atomic-openshift-master-api.service" and "journalctl -xe" for details.


Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/system/systemd.py
<ocpmaster1.arctiqlab.local> ESTABLISH SSH CONNECTION FOR USER: root
<ocpmaster1.arctiqlab.local> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r ocpmaster1.arctiqlab.local '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1488257065.33-84742852428306 `" && echo ansible-tmp-1488257065.33-84742852428306="` echo $HOME/.ansible/tmp/ansible-tmp-1488257065.33-84742852428306 `" ) && sleep 0'"'"''
<ocpmaster1.arctiqlab.local> PUT /tmp/tmp6fTDDM TO /root/.ansible/tmp/ansible-tmp-1488257065.33-84742852428306/systemd.py
<ocpmaster1.arctiqlab.local> SSH: EXEC sftp -b - -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r '[ocpmaster1.arctiqlab.local]'
<ocpmaster1.arctiqlab.local> ESTABLISH SSH CONNECTION FOR USER: root
<ocpmaster1.arctiqlab.local> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r ocpmaster1.arctiqlab.local '/bin/sh -c '"'"'chmod u+x /root/.ansible/tmp/ansible-tmp-1488257065.33-84742852428306/ /root/.ansible/tmp/ansible-tmp-1488257065.33-84742852428306/systemd.py && sleep 0'"'"''
<ocpmaster1.arctiqlab.local> ESTABLISH SSH CONNECTION FOR USER: root
<ocpmaster1.arctiqlab.local> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r -tt ocpmaster1.arctiqlab.local '/bin/sh -c '"'"'/usr/bin/python /root/.ansible/tmp/ansible-tmp-1488257065.33-84742852428306/systemd.py; rm -rf "/root/.ansible/tmp/ansible-tmp-1488257065.33-84742852428306/" > /dev/null 2>&1 && sleep 0'"'"''
fatal: [ocpmaster1.arctiqlab.local]: FAILED! => {
    "attempts": 1,
    "changed": false,
    "failed": true,
    "invocation": {
        "module_args": {
            "daemon_reload": false,
            "enabled": true,
            "masked": null,
            "name": "atomic-openshift-master-api",
            "state": "started",
            "user": false
        },
        "module_name": "systemd"
    }
}

MSG:

Unable to start service atomic-openshift-master-api: Job for atomic-openshift-master-api.service failed because the control process exited with error code. See "systemctl status atomic-openshift-master-api.service" and "journalctl -xe" for details.



NO MORE HOSTS LEFT *************************************************************
    to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************
localhost                  : ok=11   changed=0    unreachable=0    failed=0
ocpapp1.arctiqlab.local      : ok=45   changed=0    unreachable=0    failed=0
ocpapp2.arctiqlab.local      : ok=45   changed=0    unreachable=0    failed=0
ocpbindnfs1.arctiqlab.local  : ok=63   changed=2    unreachable=0    failed=0
ocpinfra1.arctiqlab.local    : ok=45   changed=0    unreachable=0    failed=0
ocpinfra2.arctiqlab.local    : ok=45   changed=0    unreachable=0    failed=0
ocpmaster1.arctiqlab.local   : ok=256  changed=61   unreachable=0    failed=1
ocpmaster2.arctiqlab.local   : ok=233  changed=66   unreachable=0    failed=0
ocpmaster3.arctiqlab.local   : ok=233  changed=64   unreachable=0    failed=0

Now next step will be dig deeper by getting on one of the masters and looking at journalctl -xe you will see some additional information (maybe something like this):

connection refused was the tip…

Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.585254   29644 reflector.go:203] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/limitranger/admission.go:154: Failed to l
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.585310   29644 reflector.go:214] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: Failed t
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.585382   29644 reflector.go:203] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/limitranger/admission.go:154: Failed to l
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.585441   29644 reflector.go:214] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/storageclass/default/admission.go:74: Fai
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.587003   29644 reflector.go:214] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/resourcequota/resource_access.go:83: Fail
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.592799   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.PolicyBinding: client: etcd cluster is u
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.186:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.592880   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.ClusterPolicyBinding: client: etcd clust
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.186:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.594217   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.Policy: client: etcd cluster is unavaila
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.138:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.594281   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.ClusterPolicy: client: etcd cluster is u
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.138:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.597702   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.Group: client: etcd cluster is unavailab
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.138:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.186:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.603488   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.User: client: etcd cluster is unavailabl
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.186:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:23.603542   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.OAuthAccessToken: client: etcd cluster i
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:23 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.186:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.586531   29644 reflector.go:214] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/storageclass/default/admission.go:74: Fai
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.586526   29644 reflector.go:203] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/limitranger/admission.go:154: Failed to l
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.586634   29644 reflector.go:203] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/limitranger/admission.go:154: Failed to l
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.586647   29644 reflector.go:214] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:103: Failed t
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.586713   29644 reflector.go:214] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: Failed t
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.588378   29644 reflector.go:214] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/resourcequota/resource_access.go:83: Fail
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.596734   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.PolicyBinding: client: etcd cluster is u
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.186:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.596800   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.ClusterPolicyBinding: client: etcd clust
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.186:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.598776   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.Policy: client: etcd cluster is unavaila
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.138:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.599091   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.ClusterPolicy: client: etcd cluster is u
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.138:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.601257   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.Group: client: etcd cluster is unavailab
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.138:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.186:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.606686   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.OAuthAccessToken: client: etcd cluster i
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.186:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: E0227 23:56:24.606732   29644 cacher.go:254] unexpected ListAndWatch error: pkg/storage/cacher.go:194: Failed to list *api.User: client: etcd cluster is unavailabl
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #1: dial tcp 10.166.21.132:2379: getsockopt: connection refused
Feb 27 23:56:24 ocpmaster1.arctiqlab.local atomic-openshift-master-api[29644]: ; error #2: dial tcp 10.166.21.186:2379: getsockopt: connection refused
lines 1009-1061/1061 (END)

So some more google searching will prob lead you to one of these bugzilla’s but they don’t provide a solution or a work around - remember all you want to do in have your OCP install go well. (I will add a link to this blog also)

https://bugzilla.redhat.com/show_bug.cgi?id=1393187

https://bugzilla.redhat.com/show_bug.cgi?id=1375111 - we have a work around for this one

So how https://bugzilla.redhat.com/show_bug.cgi?id=1375111 did I troubleshoot this? As you can see the error logs is showing internal 10. ip’s. When you are installing OpenShift you have this nice handy facts playbook, it basically gathers all the facts about your setup. This is what Ansible will use when actually doing the install so the information here is very important.

ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift_facts.yml

Here it the output below that directed me to the root cause - See the 169.55.191. IP ADDRESS?** Well this is the external (public ip), I don’t want it using that IP to communicate across etcd nodes. It will fail as I don’t open all those ports on the internet side of things. This odd part is that I use an internal BIND server for all my communication and the FQDN names use all the internal 10. blocks. My entire Ansible host file is based on FQDN names. In the end I don’t want anything to do with the public address - but how do I get around this? I could not find a solution even with all my google searching. Lots of attempts failed…

TASK [debug] *******************************************************************
ok: [ocpmaster1.arctiqlab.local] => {
    "result": {
        "ansible_facts": {
            "openshift": {
                "common": {
                    "admin_binary": "oadm",
                    "all_hostnames": [
                        "169.55.191.**",
                        "ocpmaster1.arctiqlab.local"

Small Segway… One great thing about having an amazing team at Arctiq is we can collaborate on things that stump us. So I send a Slack message to Shea. Quickly we figure out we can’t solve this over chat, so I call him… explain the issues, show him what I am seeing… and in the end we found a ‘what I call’ an ‘undocumented’ work around.

**Yah collaboration / team!!! **

So in the end we used a working around in the Ansible host file, if you are installing OpenShift you will know this file well. It generally lives in etc/ansible/hosts (we recommend not running it from here but that is another blog).

SO HERE IS THE WORKAROUND -> Its a super simple workaround but again it took me a long time to figure it out and the errors can send you chasing other “non” issues… Drum roll …

#Multi Master Configuration
[etcd]
ocpmaster1.arctiqlab.local openshift_ip=10.166.21.138
ocpmaster2.arctiqlab.local openshift_ip=10.166.21.186
ocpmaster3.arctiqlab.local openshift_ip=10.166.21.132

Just add openshift_ip=<ip address> to all references to hostnames in your host file. When Ansible gathers facts it will force it to use the IP you define rather that it grabbing the public IP you want nothing to do with… After I made this change, I ran the clean-up / uninstall scripts and ran my install again and all went fine.

Lesson learned and I hope it helps someone a lot of time, effort and stress some day.

Now remember to “Pay it Forward” and write a blog when you solve a weird technical issue.

Tagged:



//comments