ClusterShell: parallel SSH on many hosts

How do you gather uptime information from a large number of remote hosts? Open a bunch of terminals and paste the command to each of them? Loop over the hosts with a shell script? Thankfully, there is a better way.

Table of Contents

Introduction

I have learned a lot during the Locked Shields exercise. One of the key takeaways for me was the importance of quickly running ad-hoc commands on the many machines I administered. Ansible wasn’t quite suitable for this task, and I refused to write another bad shell script. The solution: ClusterShell.

ClusterShell (or clush) is a CLI tool that allows us to run shell commands on many hosts and copy files to and from them. Outputs can be saved to a directory for later analysis. In this article, I go through the configuration, common use cases, and a few gotchas to be aware of. But first, here’s a demo:

Configuring the hosts

Host file

The simplest method of specifying hosts to run on is creating a simple text file:

# example.txt
host01.example.com
host02.example.com
host03.example.com

We use the --hostfile flag to target all hosts in the file. In the following example, we run the command id on the hosts specified in the file:

clush --hostfile example.txt id
# host01.example.com: uid=1000(user) gid=1000(user) groups=1000(user),27(sudo)
# host02.example.com: uid=1000(user) gid=1000(user) groups=1000(user),27(sudo)
# host03.example.com: uid=1000(user) gid=1000(user) groups=1000(user),27(sudo)

It is simple, but annoying to specify the host file over and over again. We also miss advanced features, such as groups and patterns. More about that later.

YAML files

Let’s create ClusterShell groups configuration file in our local config directory:

# ~/.config/clustershell/groups.conf

[Main]
autodir: $CFGDIR/groups
default: staticyaml

The Main section contains two parameters:

  • autodir - directory with YAML files containing sources, groups, and hosts
  • default - the default source

As you can see above, the default source is staticyaml, and the groups directory is ~/.config/clustershell/groups/. Let’s now create the YAML file inside of it and define a few hosts:

# ~/.config/clustershell/groups/hosts.yml
staticyaml:
  example:
    - host01.example.com
    - host02.example.com
    - host03.example.com
  homelab:
    - homelab01.localdomain
    - homelab02.localdomain

The file contains one source called staticyaml and two groups, example and homelab each with several hosts. We can now run clush like this:

# run `uptime` on all hosts from source staticyaml
clush -a -s staticyaml uptime

# staticyaml is the default source, so we don't have to specify it
# explicitly
clush -a uptime

# run `uptime` on homelab group
clush -g homelab uptime

# run `uptime` on hosts matching the pattern 
clush -w 'host[01-02]*' uptime

External commands

The third and most powerful way is specifying external commands that return the hosts. We can, for example, use this to calculate hosts based on Ansible inventory or SSH configuration.

Let’s extend the groups.conf configuration:

# ~/.config/clustershell/groups.conf

[Main]
autodir: $CFGDIR/groups
default: staticyaml

[ssh]
map: grep -Po "(?<=^Host).*$" ~/.ssh/config | tr -d ' '
all: grep -Po "(?<=^Host).*$" ~/.ssh/config | tr -d ' '

[ls14]
map: $CFGDIR/ansible_hosts.py --cwd /home/me/code/ls14/ansible --group $GROUP
all: $CFGDIR/ansible_hosts.py --cwd /home/me/code/ls14/ansible
list: $CFGDIR/ansible_hosts.py --cwd /home/me/code/ls14/ansible --list

We have now configured two more sources called ssh and ls14. Each source has to specify a few parameters:

  • map returns hosts that belong to a particular group (required)
  • all returns all hosts (optional)
  • list returns all groups (optional)
Note

The current working directory of map, all, and list commands is the configuration directory, not the directory from which you start clush. Unfortunately, that’s the way clush works.

The ssh source includes all hosts defined in your ~/.ssh/config file. This only works if your SSH config doesn’t include any wildcards or patterns. The ls14 source specifies a python script that returns hosts and groups based on an Ansible inventory.

Now we have three sources in total: staticyaml which we created in the beginning, and ssh and ls14, which we have added just now. Remember, we have configured staticyaml to be the default, so we now have to specify the other sources explicitly:

# run `uptime` all hosts in ~/.ssh/ssh_config
clush -s ssh -a uptime

# run `uname -r` on all hosts defined in the ansible inventory
clush -s ls14 -a uname -r

# run `uname -r` group private defined in the ansible inventory
clush -s ls14 -g private uname -r

This is about it for configuring ClusterShell. Let’s clarify a few remaining questions.

Q&A

Are the SSH sessions persistent?

No, new session is started for each command, including in the interactive mode. Notice that e.g. changing directories does not work:

clush -w homelab02.localdomain
Enter 'quit' to leave this interactive mode
Working with nodes: homelab02.localdomain
clush> pwd
homelab02.localdomain: /home/user
clush> cd /
clush> pwd
homelab02.localdomain: /home/user

Instead, we have to do this:

clush -w homelab02.localdomain
Enter 'quit' to leave this interactive mode
Working with nodes: homelab02.localdomain
clush> cd / && pwd
homelab02.localdomain: /

Another option is passing bash scripts.

How to run bash scripts?

cat script.sh  | clush -w homelab02.localdomain

How to capture output?

Watch out

The files are overwritten each time the command is run. Output is not appended.

cat script.sh  | clush --outdir outdir --stderr errdir -w 'homelab[01-02].localdomain' bash
# homelab01.localdomain: /home/user
# homelab01.localdomain: Linux
# homelab01.localdomain: This goes to stderr
# homelab02.localdomain: /home/user
# homelab02.localdomain: Linux
# homelab02.localdomain: This goes to stderr

tree outdir errdir
# errdir
# ├── homelab01.localdomain
# └── homelab02.localdomain 
# outdir
# ├── homelab01.localdomain 
# └── homelab02.localdomain 

cat errdir/homelab02.localdomain
# This goes to stderr

How to copy files?

# upload files
clush -w 'homelab[01-02].localdomain' --copy .zshrc .dotfiles/

# upload files to specified directory
clush -w 'homelab[01-02].localdomain' --copy .zshrc .dotfiles/ --dest /tmp/upload

# download files
mkdir download
clush -w 'homelab[01-02].localdomain' --rcopy .bash_history --dest download

tree download
# download
# ├── .zshrc.homelab01.localdomain
# └── .zshrc.homelab02.localdomain

How about sudo?

If you cannot ssh as root and you need to use sudo to run privileged tasks, create a new file clush.conf in your ClusterShell configuration directory and add the following lines. Here, we define a mode and set a command prefix to all commands run via clush:

# ~/.config/clustershell/clush.conf

[mode:sudo]
password_prompt: yes
command_prefix: /usr/bin/sudo -S -p "''"

You can now run the clush with the -m sudo flag. All hosts must share the same sudo passwords, which is far from ideal. It is more secure to allow root to SSH in with a SSH key backed by FIDO2 or TPM.

Various SSH keys, ports, hostnames, options, etc.

Use SSH config to configure these. There is a tool ansible-inventory-to-ssh-config that can help with that.

How about SSH host keys?

SSH host keys are used to authenticate the server to the client. Before a first connection, the user must interactively verify the SSH host key before connection. Here’s a simple bash script that fetches hostnames for a list of domains.

while read domain; do
  if ! ssh-keygen -F "$domain" -f ~/.ssh/known_hosts | grep found; then
    ssh-keyscan -H "$domain" >> ~/.ssh/known_hosts
    echo
  fi
done

The script does not work for IP addresses and non-default ports. For these use cases, the command would like something like this:

if ! ssh-keygen -F '[127.0.0.1]:2222' -f ~/.ssh/known_hosts | grep found; then
  ssh-keyscan -p 2222 -H '127.0.0.1' >> ~/.ssh/known_hosts
fi