Ezjail host

From TykWiki
Jump to navigationJump to search

Background

This is my personal checklist for when I am setting up a new ezjail host. I like my jail hosts configured in a very specific way. There is a good chance that what is right for me is not right for you. As always, YMMV.

Also note that I talk a lot about the German hosting provider Hetzner, if you are using another provider or you are doing this at home, just ignore the Hetzner specific stuff. Much of the content here can be used with little or no changes outside Hetzner.

Installation

OS install with mfsbsd

After receiving the server from Hetzner I boot it using the rescue system which puts me at an mfsbsd prompt via SSH. This is perfect for installing a zfs-only server.

Changes to zfsinstall

I edit the zfsinstall script /root/bin/zfsinstall and add "usr" to FS_LIST near the top of the script. I do this because I like to have /usr as a seperate ZFS dataset.

Check disks

I create a small zpool using just 30gigs, enough to confortably install the base OS and so on. The rest of the diskspace will be used for GELI which will have the other zfs pool on top. This encrypted zpool will house the actual jails and data. This setup allows me to have all the important data encrypted, while allowing the physical server to boot without human intervention like full disk encryption would require.

Note that the disks in this server are not new, they have been used for around two years (18023 hours/24 = 702 days):

[root@rescue ~]# grep "ada[0-9]:" /var/run/dmesg.boot | grep "MB "
ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
[root@rescue ~]# smartctl -a /dev/ada0 | grep Power_On_Hours
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       18023
[root@rescue ~]# smartctl -a /dev/ada1 | grep Power_On_Hours
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       18023
[root@rescue ~]#

Destroy existing partitions

Any existing partitions need to be deleted first. This can be done with the destroygeom command like shown below:

[root@rescue ~]# destroygeom -d ada0 -d ada1
Destroying geom ada0:
    Deleting partition 3 ... done
Destroying geom ada1:
    Deleting partition 1 ... done
    Deleting partition 2 ... done
    Deleting partition 3 ... done

Install FreeBSD

Installing FreeBSD with mfsbsd is easy. I run the below command, adjusting the release I want to install of course:

[root@rescue ~]# zfsinstall -d ada0 -d ada1 -r mirror -z 30G -t /nfs/mfsbsd/10.0-release-amd64.tbz
Creating GUID partitions on ada0 ... done
Configuring ZFS bootcode on ada0 ... done
=>        34  3907029101  ada0  GPT  (1.8T)
          34        2014        - free -  (1M)
        2048         128     1  freebsd-boot  (64k)
        2176    62914560     2  freebsd-zfs  (30G)
    62916736  3844112399        - free -  (1.8T)

Creating GUID partitions on ada1 ... done
Configuring ZFS bootcode on ada1 ... done
=>        34  3907029101  ada1  GPT  (1.8T)
          34        2014        - free -  (1M)
        2048         128     1  freebsd-boot  (64k)
        2176    62914560     2  freebsd-zfs  (30G)
    62916736  3844112399        - free -  (1.8T)

Creating ZFS pool tank on ada0p2 ada1p2 ... done
Creating tank root partition: ... done
Creating tank partitions: var tmp usr ... done
Setting bootfs for tank to tank/root ... done
NAME            USED  AVAIL  REFER  MOUNTPOINT
tank            270K  29.3G    31K  none
tank/root       127K  29.3G    34K  /mnt
tank/root/tmp    31K  29.3G    31K  /mnt/tmp
tank/root/usr    31K  29.3G    31K  /mnt/usr
tank/root/var    31K  29.3G    31K  /mnt/var
Extracting FreeBSD distribution ... done
Writing /boot/loader.conf... done
Writing /etc/fstab...Writing /etc/rc.conf... done
Copying /boot/zfs/zpool.cache ... done

Installation complete.
The system will boot from ZFS with clean install on next reboot

You may type "chroot /mnt" and make any adjustments you need.
For example, change the root password or edit/create /etc/rc.conf for
for system services.

WARNING - Don't export ZFS pool "tank"!
[root@rescue ~]#

Post install configuration (before reboot)

Before rebooting into the installed FreeBSD I need to make certain I can reach the server through SSH after the reboot. This means:

  1. Adding network settings to /etc/rc.conf
  2. Adding sshd_enable="YES" to /etc/rc.conf
  3. Change PermitRootLogin to Yes in /etc/ssh/sshd_config Note: In the current This is now the default in the zfsinstall image that Hetzner provides
  4. Add nameservers to /etc/resolv.conf
  5. Finally I set the root password.

All of these steps are essential if I am going to have any chance of logging in after reboot. Most of these changes can be done from the mfsbsd shell but the password change requires chroot into the newly installed environment.

I use the chroot command but start another shell as bash is not installed in /mnt:

[root@rescue ~]# chroot /mnt/ csh
rescue# ee /etc/rc.conf
rescue# ee /etc/ssh/sshd_config
rescue# passwd
New Password:
Retype New Password:
rescue#

So, the network settings are sorted, root password is set, and root is permitted to ssh in. Time to reboot (this is the exciting part).

Remember to use shutdown -r now and not reboot when you reboot. shutdown -r now performs the proper shutdown process including rc.d scripts and disk buffer flushing. reboot is the "bigger hammer" to use when something is preventing shutdown -r now from working.

Basic config after first boot

If the server boots without any problems, I do some basic configuration before I continue with the disk partitioning.

Timezone

I run the command tzsetup to set the proper timezone, and set the time using ntpdate if neccesary.

Note: The current hetzner freebsd image has the timezone set to CEST, I like my servers configured as UTC

Basic ports

I also add some basic ports with pkg so I can get screen etc. up and running as soon as possible:

# pkg install bash screen sudo portmaster

I then add the following to /usr/local/etc/portmaster.rc:

ALWAYS_SCRUB_DISTFILES=dopt
PM_DEL_BUILD_ONLY=pm_dbo
SAVE_SHARED=wopt
PM_LOG=/var/log/portmaster.log
PM_IGNORE_FAILED_BACKUP_PACKAGE=pm_ignore_failed_backup_package

An explanation of these options can be found on the Portmaster page.

After a rehash and adding my non-root user with adduser, I am ready to continue with the disk configuration. I also remember to disable root login in /etc/ssh/sshd_config.

Further disk configuration

After the reboot into the installed FreeBSD environment, I need to do some further disk configuration.

Create swap partitions

Swap-on-zfs is not a good idea for various reasons. To keep my swap encrypted but still off zfs I use geli onetime encryption. To avoid problems if a disk dies I also use gmirror. First I add the partitions with gpart:

$ sudo gpart add -t freebsd-swap -s 10G /dev/ada0
ada0p3 added
$ sudo gpart add -t freebsd-swap -s 10G /dev/ada1
ada1p3 added
$

Then I make sure gmirror is loaded, and loaded on boot:

$ sudo sysrc -f /boot/loader.conf geom_mirror_load="YES"
$ sudo kldload geom_mirror

Then I create the gmirror:

sudo gmirror label swapmirror /dev/ada0p3 /dev/ada1p3

Finally I add the following line to /etc/fstab to get encrypted swap on top of the gmirror:

/dev/mirror/swapmirror.eli  none    swap    sw,keylen=256,sectorsize=4096     0       0

I can enable the new swap partition right away:

$ sudo swapon /dev/mirror/swapmirror.eli
$ swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/mirror/swapmirror.eli   8388604        0  8388604     0%
$

Create GELI partitions

First I create the partitions to hold the geli devices:

$ sudo gpart add -t freebsd-ufs ada0
ada0p4 added
$ sudo gpart add -t freebsd-ufs ada1
ada1p4 added

I add them as freebsd-ufs type partitions, as there is no dedicated freebsd-geli type.

Create GELI key

To create a GELI key I copy some data from /dev/random:

$ sudo dd if=/dev/random of=/root/geli.key bs=256k count=1
1+0 records in
1+0 records out
262144 bytes transferred in 0.003347 secs (78318372 bytes/sec)
$

Create GELI volumes

I create the GELI volumes with 4k blocksize and 256bit AES encryption:

$ sudo geli init -s 4096 -K /root/geli.key -l 256 /dev/ada0p4
Enter new passphrase:
Reenter new passphrase:

Metadata backup can be found in /var/backups/ada0p4.eli and
can be restored with the following command:

        # geli restore /var/backups/ada0p4.eli /dev/ada0p4

$ sudo geli init -s 4096 -K /root/geli.key -l 256 /dev/ada1p4
Enter new passphrase:
Reenter new passphrase:

Metadata backup can be found in /var/backups/ada1p4.eli and
can be restored with the following command:

        # geli restore /var/backups/ada1p4.eli /dev/ada1p4

$

Enable AESNI

Most Intel CPUs have hardware acceleration of AES which helps a lot with GELI performance. I load the aesni module during boot from /boot/loader.conf:

$ sudo sysrc -f /boot/loader.conf aesni_load="YES"

Attach GELI volumes

Now I just need to attach the GELI volumes before I am ready to create the second zpool:

$ sudo geli attach -k /root/geli.key /dev/ada0p4
Enter passphrase:
$ sudo geli attach -k /root/geli.key /dev/ada1p4
Enter passphrase:
$

Create second zpool

$ sudo zpool create gelipool mirror /dev/ada0p4.eli /dev/ada1p4.eli
$ zpool status
  pool: gelipool
 state: ONLINE
  scan: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        gelipool        ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            ada0p4.eli  ONLINE       0     0     0
            ada1p4.eli  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

errors: No known data errors
$

Create ZFS filesystems on the new zpool

The last remaining thing is to create a filesystem in the new zfs pool:

$ zfs list
NAME            USED  AVAIL  REFER  MOUNTPOINT
gelipool        624K  3.54T   144K  /gelipool
tank            704M  28.6G    31K  none
tank/root       704M  28.6G   413M  /
tank/root/tmp    38K  28.6G    38K  /tmp
tank/root/usr   291M  28.6G   291M  /usr
tank/root/var   505K  28.6G   505K  /var
$ sudo zfs set mountpoint=none gelipool
$ sudo zfs set compression=on gelipool
$ sudo zfs create -o mountpoint=/usr/jails gelipool/jails
$ zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
gelipool         732K  3.54T   144K  none
gelipool/jails   144K  3.54T   144K  /usr/jails
tank             704M  28.6G    31K  none
tank/root        704M  28.6G   413M  /
tank/root/tmp     38K  28.6G    38K  /tmp
tank/root/usr    291M  28.6G   291M  /usr
tank/root/var    505K  28.6G   505K  /var
$

Disable atime

One last thing I like to do is to disable atime or access time on the filesystem. Access times are recorded every time a file is read, and while this can have it's use cases, I never use it. Disabling it means a lot fewer write operations, as a read operation doesn't automatically include a write operation when atime is disabled. Disabling it is easy:

$ sudo zfs set atime=off tank
$ sudo zfs set atime=off gelipool
$

The next things are post-install configuration stuff like OS upgrade, ports, firewall and so on. The basic install is finished \o/

Reserved space

Running out of space in ZFS is bad. Stuff will run slowly and may stop working entirely until some space is freed. The problem is that ZFS is a journalled filesystem which means that all writes, even a deletion, requires writing data to the disk. I've more than once wound up in a situation where I couldn't delete file to free up diskspace because the disk was full.

Sometimes this can be resolved by overwriting a large file, like:

$ echo > /path/to/a/large/file

This will overwrite the file thereby freeing up some space, but sometimes even this is not possible. This is where reserved space comes in. I create a new filesystem in each pool, set them readonly and without a mountpoint, and with 1G reserved each:

$ sudo zfs create -o mountpoint=none -o reservation=1G -o readonly=on gelipool/reserved
$ sudo zfs create -o mountpoint=none -o reservation=1G -o readonly=on tank/reserved

If I run out of space for some reason, I can just delete the dataset, or unset the reserved property, and I immediately have 1G diskspace available. Yay!

Ports

Installing the ports tree

I need to bootstrap the ports system, I use portsnap as it is way faster than using c(v)sup. Initially I run portsnap fetch extract and when I need to update the tree later I use portsnap fetch update.

smartd

I install smartd to monitor the disks for problems:

$ sudo pkg install smartmontools

I create the file /usr/local/etc/smartd.conf and add this line to it:

DEVICESCAN -a -m thomas@gibfest.dk

This makes smartd monitor all disks and send me an email if it finds an error.

Remember to enable smartd in /etc/rc.conf and start it:

sudo sysrc smartd_enable="YES"
sudo service smartd start

openntpd

I install net/openntpd to keep the clock in sync. I find this a lot easier to configure than the base ntpd.

sudo pkg install openntpd

I enable openntpd in /etc/rc.conf:

sudo sysrc openntpd_enable="YES"

and add a one line config file:

$ grep -v "^#" /usr/local/etc/ntpd.conf | grep -v "^$"
servers de.pool.ntp.org
$

sync the clock and start openntpd:

sudo ntpdate de.pool.ntp.org
sudo service openntpd start

ntpdate

I also enable ntpdate to help set the clock after a reboot. I add the following two lines to /etc/rc.conf:

sudo sysrc ntpdate_enable="YES"
sudo sysrc ntpdate_hosts="de.pool.ntp.org"

Upgrade OS (buildworld)

I usually run -STABLE on my hosts, which means I need to build and install a new world and kernel. I also like having rctl available on my jail hosts, so I can limit jail ressources in all kinds of neat ways. I also like having DTRACE available. Additionally I also need the built world to populate ezjails basejail.

Note: I will need to update the host and the jails many times during the lifespan of this server, which is likely > 2-3 years. As new security problems are found or features are added that I want, I will update host and jails. There is a section about staying up to date later in this page. This section (the one you are reading now) only covers the OS update I run right after installing the server.

Fetching sources

First I get the sources:

[tykling@jail1 ~]$ sudo git clone -b stable/13 https://git.freebsd.org/src.git /usr/src
Password:
Cloning into '/usr/src'...
remote: Enumerating objects: 4060913, done.
remote: Counting objects: 100% (379329/379329), done.
remote: Compressing objects: 100% (27474/27474), done.
remote: Total 4060913 (delta 373583), reused 351855 (delta 351855), pack-reused 3681584
Receiving objects: 100% (4060913/4060913), 1.38 GiB | 5.94 MiB/s, done.
Resolving deltas: 100% (3217803/3217803), done.
Updating files: 100% (86931/86931), done.
[tykling@jail1 ~]$

This takes a while the first time, but subsequent git pull runs are much faster.

Note: I used to use svn or svnlite here but since the migration to Git I switched to using the regular git client

Create kernel config

I used to create a kernel config to get RACCT and RCTL but these days both are included in GENERIC, so no need for that anymore. Yay.

Building world and kernel

Finally I start the build. I use -j to start one thread per core in the system. sysctl hw.ncpu shows the number of available cores:

# sysctl hw.ncpu
hw.ncpu: 12

To build the new system:

$ sudo -i bash
# cd /usr/src/
# time (make -j$(sysctl -n hw.ncpu) buildworld && make -j$(sysctl -n hw.ncpu) kernel) && date

After the build finishes, reboot into the newly built kernel and run mergemaster, installworld, and mergemaster again; and finally delete-old and delete-old-libs:

$ sudo -i bash
# cd /usr/src/
# mergemaster -pFUi && make installworld && mergemaster -FUi && make -DBATCH_DELETE_OLD_FILES delete-old && make -DBATCH_DELETE_OLD_FILES delete-old-libs

PAY ATTENTION DURING MERGEMASTER! DO NOT OVERWRITE /etc/group AND /etc/master.passwd AND OTHER CRITICAL FILES!

Reboot after the final mergemaster completes, and boot into the newly built world.

Preparing ezjail

ezjail needs to be installed and a bit of configuration is also needed, in addition to bootstrapping /usr/jails/basejail and /usr/jails/newjail.

Installing ezjail

Just install it with pkg:

sudo pkg install ezjail

Configuring ezjail

Then I go edit the ezjail config file /usr/local/etc/ezjail.conf and add/change these three lines near the bottom:

ezjail_use_zfs="YES"
ezjail_jailzfs="gelipool/jails"
ezjail_use_zfs_for_jails="YES"

This makes ezjail use seperate zfs datasets under gelipool/jails for the basejail and newjail, as well as for each jail created. ezjail_use_zfs_for_jails is supported since ezjail 3.2.2.

Bootstrapping ezjail

Finally I populate basejail and newjail from the world I build earlier:

$ sudo ezjail-admin update -i

The last line of the output is a message saying:

Note: a non-standard /etc/make.conf was copied to the template jail in order to get the ports collection running inside jails.

This is because ezjail defaults to symlinking the ports collection in the same way it symlinks the basejail. I prefer having seperate/individual ports collections in each of my jails though, so I remove the symlink and make.conf from newjail:

$ sudo rm /usr/jails/newjail/etc/make.conf /usr/jails/newjail/usr/ports /usr/jails/newjail/usr/src
$ sudo mkdir /usr/jails/newjail/usr/src

ZFS goodness

Note that ezjail has created two new ZFS datasets to hold basejail and newjail:

$ zfs list -r gelipool/jails
NAME                      USED  AVAIL  REFER  MOUNTPOINT
gelipool/jails            239M  3.54T   476K  /usr/jails
gelipool/jails/basejail   236M  3.54T   236M  /usr/jails/basejail
gelipool/jails/newjail   3.10M  3.54T  3.10M  /usr/jails/newjail

ezjail flavours

ezjail has a pretty awesome feature that makes it possible to create templates or flavours which apply common settings when creating a new jail. I always have a basic flavour which adds a user for me, installs an SSH key, adds a few packages like bash, screen, sudo and portmaster - and configures those packages. Basically, everything I find myself doing over and over again every time I create a new jail.

It is also possible, of course, to create more advanced flavours, I've had one that installs a complete nginx+php-fpm server with all the neccesary packages and configs.

ezjail flavours are technically pretty simple. By default, they are located in the same place as basejail and newjail, and ezjail comes with an example flavour to get you started. Basically a flavour is a file/directory hierachy which is copied to the jail, and a shell script called ezjail.flavour which is run once, the first time the jail is started, and then deleted.

For reference, I've included my basic flavour here. First is a listing of the files included in the flavour, and then the ezjail.flavour script which performs tasks beyond copying config files.

$ find /usr/jails/flavours/tykbasic
/usr/jails/flavours/tykbasic
/usr/jails/flavours/tykbasic/ezjail.flavour
/usr/jails/flavours/tykbasic/usr
/usr/jails/flavours/tykbasic/usr/local
/usr/jails/flavours/tykbasic/usr/local/etc
/usr/jails/flavours/tykbasic/usr/local/etc/portmaster.rc
/usr/jails/flavours/tykbasic/usr/local/etc/sudoers
/usr/jails/flavours/tykbasic/usr/home
/usr/jails/flavours/tykbasic/usr/home/tykling
/usr/jails/flavours/tykbasic/usr/home/tykling/.ssh
/usr/jails/flavours/tykbasic/usr/home/tykling/.ssh/authorized_keys
/usr/jails/flavours/tykbasic/usr/home/tykling/.screenrc
/usr/jails/flavours/tykbasic/etc
/usr/jails/flavours/tykbasic/etc/fstab
/usr/jails/flavours/tykbasic/etc/rc.conf
/usr/jails/flavours/tykbasic/etc/periodic.conf
/usr/jails/flavours/tykbasic/etc/resolv.conf

As you can see, the flavour contains files like /etc/resolv.conf and other stuff to make the jail work. The name of the flavour here is tykbasic which means that if I want a file to end up in /usr/home/tykling after the flavour has been applied, I need to put that file in the folder /usr/jails/flavours/tykbasic/usr/home/tykling/ - remember to also chown the files in the flavour appropriately.

Finally, my ezjail.flavour script looks like so:

#!/bin/sh
#
# BEFORE: DAEMON
#
# ezjail flavour example

# Timezone
###########
#
ln -s /usr/share/zoneinfo/Europe/Copenhagen /etc/localtime

# Groups
#########
#
pw groupadd -q -n tykling

# Users
########
#
# To generate a password hash for use here, do:
# openssl passwd -1 "the password"
echo -n '$1$L/fC0UrO$bi65/BOIAtMkvluDEDCy31' | pw useradd -n tykling -u 1001 -s /bin/sh -m -d /usr/home/tykling -g tykling -c 'tykling' -H 0

# Packages
###########
#
env ASSUME_ALWAYS_YES=YES pkg bootstrap
pkg install -y bash
pkg install -y sudo
pkg install -y portmaster
pkg install -y screen


#change shell to bash
chsh -s bash tykling

#update /etc/aliases
echo "root:   thomas@gibfest.dk" >> /etc/aliases
newaliases

#remove adjkerntz from crontab
cat /etc/crontab | grep -E -v "(Adjust the time|adjkerntz)" > /etc/crontab.new
mv /etc/crontab.new /etc/crontab

#remove ports symlink
rm /usr/ports

# create symlink to /usr/home in / (adduser defaults to /usr/username as homedir)
ln -s /usr/home /home

Creating a flavour is easy: just create a folder under /usr/jails/flavours/ that has the name of the flavour, and start adding files and folders there. The ezjail.flavour script should be placed in the root (see the example further up the page).

Finally I add the following to /usr/local/etc/ezjail.conf to make ezjail always use my new flavour:

ezjail_default_flavour="tykbasic"

Configuration

This section outlines what I do to further prepare the machine to be a nice ezjail host.

Firewall

One of the first things I fix is to enable the pf firewall from OpenBSD. I add the following to /etc/rc.conf to enable pf at boot time:

sudo sysrc pf_enable="YES"
sudo sysrc pflog_enable="YES"

I also create a very basic /etc/pf.conf:

[root@ ~]# cat /etc/pf.conf 
### macros
if="em0"
table <portknock> persist

#external addresses
tykv4="a.b.c.d"
tykv6="2002:ab:cd::/48"
table <allowssh> { $tykv4,$tykv6 }

#local addresses
glasv4="w.x.y.z"

### scrub
scrub in on $if all fragment reassemble


################
### filtering
### block everything
block log all

################
### skip loopback interface(s)
set skip on lo0


################
### icmp6                                                                                                        
pass in quick on $if inet6 proto icmp6 all icmp6-type {echoreq,echorep,neighbradv,neighbrsol,routeradv,routersol}


################
### pass outgoing
pass out quick on $if all


################
### portknock rule (more than 5 connections in 10 seconds to the port specified will add the "offending" IP to the <portknock> table)
pass in quick on $if inet proto tcp from any to $glasv4 port 32323 synproxy state (max-src-conn-rate 5/10, overload <portknock>)

### pass incoming ssh and icmp
pass in quick on $if proto tcp from { <allowssh>, <portknock> } to ($if) port 22
pass in quick on $if inet proto icmp all icmp-type { 8, 11 }

################
### pass ipv6 fragments (hack to workaround pf not handling ipv6 fragments)
pass in on $if inet6
block in log on $if inet6 proto udp
block in log on $if inet6 proto tcp
block in log on $if inet6 proto icmp6
block in log on $if inet6 proto esp
block in log on $if inet6 proto ipv6

To load pf without rebooting I run the following:

[root@ ~]# kldload pf
[root@ ~]# kldload pflog
[root@ ~]# pfctl -ef /etc/pf.conf && sleep 60 && pfctl -d
No ALTQ support in kernel
ALTQ related functions disabled

I get no prompt after this because pf has cut my SSH connection. But I can SSH back in if I did everything right, and if not, I can just wait 60 seconds after which pf will be disabled again. I SSH in and reattach to the screen I am running this in, and press control-c, so the "sleep 60" is interrupted and pf is not disabled. Neat little trick for when you want to avoid locking yourself out :)

Process Accounting

I like to enable process accounting on my jail hosts. It can be useful in a lot of situations. I put the accounting data on a seperate ZFS dataset.

I run the following commands to enable it:

sudo zfs create tank/root/var/account
sudo touch /var/account/acct
sudo chmod 600 /var/account/acct
sudo accton /var/account/acct
sudo sysrc accounting_enable="YES"

Replacing sendmail with Postfix

I always replace Sendmail with Postfix on every server I manage. See Replacing_Sendmail_With_Postfix for more info.

Listening daemons

When you add an IP alias for a jail, any daemons listening on * will also listen on the jails IP, which is not what I want. For example, I want the jails sshd to be able to listen on the jails IP on port 22, instead of the hosts sshd. Check for listening daemons like so:

$ sockstat -l46
USER     COMMAND    PID   FD PROTO  LOCAL ADDRESS         FOREIGN ADDRESS      
root     master     1554  12 tcp4   *:25                  *:*
root     master     1554  13 tcp6   *:25                  *:*
root     sshd       948   3  tcp6   *:22                  *:*
root     sshd       948   4  tcp4   *:22                  *:*
root     syslogd    789   6  udp6   *:514                 *:*
root     syslogd    789   7  udp4   *:514                 *:*
[tykling@glas ~]$ 

This tells me that I need to change Postfix, sshd and syslogd to stop listening on all IP addresses.

Postfix

The defaults in Postfix are really nice on FreeBSD, and most of the time a completely empty config file is fine for a system mailer (sendmail replacement). However, to make Postfix stop listening on port 25 on all IP addresses, I do need one line in /usr/local/etc/postfix/main.cf:

$ cat /usr/local/etc/postfix/main.cf
inet_interfaces=localhost
$ 


sshd

To make sshd stop listening on all IP addresses I uncomment and edit the ListenAddress line in /etc/ssh/sshd_config:

$ grep ListenAddress /etc/ssh/sshd_config 
ListenAddress x.y.z.226
#ListenAddress ::

(IP address obfuscated..)

syslogd

I don't need my syslogd to listen on the network at all, so I add the following line to /etc/rc.conf:

$ grep syslog /etc/rc.conf 
syslogd_flags="-ss"

Restarting services

Finally I restart Postfix, sshd and syslogd:

$ sudo /etc/rc.d/syslogd restart
Stopping syslogd.
Waiting for PIDS: 789.
Starting syslogd.
$ sudo /etc/rc.d/sshd restart
Stopping sshd.
Waiting for PIDS: 948.
Starting sshd.
$ sudo /usr/local/etc/rc.d/postfix restart
postfix/postfix-script: stopping the Postfix mail system
postfix/postfix-script: starting the Postfix mail system

A check with sockstat reveals that no more services are listening on all IP addresses:

$ sockstat -l46
USER     COMMAND    PID   FD PROTO  LOCAL ADDRESS         FOREIGN ADDRESS      
root     master     1823  12 tcp4   127.0.0.1:25          *:*
root     master     1823  13 tcp6   ::1:25                *:*
root     sshd       1617  3  tcp4   x.y.z.226:22         *:*
$ 

Network configuration

Network configuration is a big part of any jail setup. If I have enough IP addresses (ipv4 and ipv6) I can just add IP aliases as needed. If I only have one or a few v4 IPs I will need to use rfc1918 addresses for the jails. In that case, I create a new loopback interface, lo1 and add the IP aliases there. I then use the pf firewall to redirect incoming traffic to the right jail, depending on the port in use.

IPv4

If rfc1918 jails are needed, I add the following to /etc/rc.conf to create the lo1 interface on boot:

### lo1 interface for ipv4 rfc1918 jails
cloned_interfaces="lo1"

When the lo1 interface is created, or if it isn't needed, I am ready to start adding IP aliases for jails as needed.

IPv6

On the page Hetzner_ipv6 I've explained how to make IPv6 work on a Hetzner server where the supplied IPv6 default gateway is outside the IPv6 subnet assigned.

When basic IPv6 connectivity works, I am ready to start adding IP aliases for jails as needed.

Allow ping from inside jails

I add the following to /etc/sysctl.conf so the jails are allowed to do icmp ping. This enables raw socket access, which can be a security issue if you have untrusted root users in your jails. Use with caution.

#allow ping in jails
security.jail.allow_raw_sockets=1

Tips & tricks

Get jail info out of top

To make top show the jail id of the jail in which the process is running in a column, I need to specify the -j flag to top. Since this is a multi-cpu server I am working on, I also like giving the -P flag to top, to get a seperate line of cpu stats per core. Finally, I like -a to get the full commandline/info of the running processes. I add the following to my .bashrc in my homedir on the jail host:

alias top="nice top -j -P -a"

...this way I don't have to remember passing -j -P -a to top every time. Also, I've been told to run top with nice to limit the cpu used by top itself. I took the advice so the complete alias looks like above.

Base jails

My jails need various services, so:

  • I have a DNS jail with a caching DNS server which is used by all the jails.
  • I also have a syslog server on each jailhost to collect syslog from all the jails and send them to my central syslog server.
  • I have a postgres jail on each jail host, so I only need to maintain one postgres server.
  • I have a reverse webproxy jail since many of my jails serve some sort of web application, and I like to terminate SSL for those in one place in an attempt to keep the individual jails as simple as possible.
  • Finally I like to run a tor relay on each box to spend any excess resources (cpu, memory, bandwidth) on something nice. As long as I don't run an exit node I can do this completely without any risk of complaints from the provider.

This section describes how I configure each of these "base jails".

DNS jail

I don't need a public v4 IP for this jail, so I configure it with an RFC1918 v4 IP on a loopback interface, and of course a real IPv6 address.

I am used to using bind but unbound is just as good. Use whatever you are comfortable with. Make sure you permit DNS traffic from the other jails to the DNS jail.

This jail needs to be started first since the rest of the jails need it for DNS. ezjail runs rcorder on the config files in /usr/local/etc/ezjail which means I can use the normal PROVIDE: and REQUIRE: to control the jail dependencies. I change the ezjail config for my DNS jail to have the following PROVIDE: line:

# PROVIDE: dns

The rest of the jails all get the following REQUIRE: line:

# REQUIRE: dns

Syslog jail

I don't need a public v4 IP for this jail, so I configure it with an RFC1918 v4 IP on a loopback interface, and of course a real IPv6 address.

This jail gets the following PROVIDE: line in it's ezjail config file:

# PROVIDE: syslog

I use syslog-ng for this. I install the package using pkg install syslog-ng and then add a few things to the default config:

  • I use the following options, YMMV:

options { chain_hostnames(off); flush_lines(0); threaded(yes); use_fqdn(yes); keep_hostname(no); use_dns(yes); stats-freq(60); stats-level(0); };

  • I remove udp() from the default source and define a new source:

source jailsrc { udp(ip("10.0.0.1")); udp6(ip("w:x:y:z:10::1")); };

  • I also remove the line: destination console { file("/dev/console"); }; since the jail does not have access to /dev/console. I also remove any corresponding log statements that use destination(console);.
  • I add a new destination:
destination loghost{
            syslog(
                "syslog.tyktech.dk"
                transport("tls")
                port(1999)
                tls(peer-verify(required-untrusted))
                localip("10.0.0.1")
                log-fifo-size(10000)
            );
};
  • Finally tell syslog-ng to send all logdata from jailsrc to loghost:

log { source(jailsrc); destination(loghost); flags(flow_control); };

The rest of the jails get the following REQUIRE: line:

# REQUIRE: dns syslog

... and my jail flavours default /etc/syslog.conf all get this line so all the jails send their syslog messages to the syslog jail:

*.*                                             @10.0.0.1

Postgres jail

I don't need a public v4 IP for this jail, so I configure it with an RFC1918 v4 IP on a loopback interface, and of course a real IPv6 address. I add a AAAA record in DNS for the v6 IP so I have something to point the clients at.

I install the latest Postgres server port, at the time of writing that is databases/postgresql93-server. But before I can run /usr/local/etc/rc.d/postgresql initdb I need to permit the use of SysV shared memory in the jail. This is done in the ezjail config file for the jail, in the _parameters line. I need to add allow.sysvipc=1 so I change the line from:

export jail_postgres_kush_tyknet_dk_parameters=""

to:

export jail_postgres_kush_tyknet_dk_parameters="allow.sysvipc=1"

While I'm there I also change the jails # PROVIDE: line to:

# PROVIDE: postgres

And ofcourse the standard # REQUIRE: line:

# REQUIRE: dns syslog

After restarting the jail I can run initdb and start Postgres. When a jail needs a database I need to:

  • Add a DB user (with the createuser -P someusername command)
  • Add a database with the new user as owner (createdb -O someusername somedbname)
  • Add permissions in /usr/local/pgsql/data/pg_hba.conf
  • Open a hole in the firewall so the jail can reach the database on TCP port 5432
  • Add postgres to the # REQUIRE: line in the ezjail config file

Web Jail

I need a public V4 IP for the web jail and I also give it a V6 IP. Since I use a different V6 IP per website, I will need additional v6 addresses when I start adding websites. I add the v6 addresses to the web jail in batches of 10 as I need them. After creating the jail and bootstrapping the ports collection I install security/openssl and www/nginx and configure it. More on that later.

Tor relay jail

The Tor relay needs a public IP. It also needs the security/openssl and security/tor ports built (from ports, not packages, to ensure Tor is built with a recent OpenSSL to speed up ECDH). I put the following into the /usr/local/etc/tor/torrc file:

Log notice file /var/log/tor/notices.log
ORPort 443 NoListen
ORPort 9090 NoAdvertise
Address torrelay.bong.tyknet.dk
Nickname TykRelay01
ContactInfo Thomas Steen Rasmussen / Tykling <thomas@gibfest.dk> (PGP: 0x772FF77F0972FA58)
DirPort 80 NoListen
DirPort 9091 NoAdvertise
ExitPolicy reject *:*

Changing the Address and Nickname depending on the server.

A few steps (that should really be done by the port) are needed here:

sudo rm -rf /var/db/tor /var/run/tor
sudo mkdir -p /var/db/tor/data /var/run/tor /var/log/tor
sudo chown -R _tor:_tor /var/db/tor /var/log/tor /var/run/tor
sudo chmod -R 700 /var/db/tor

I also add the following line to the jail hosts /etc/sysctl.conf to make it impossible to predict IP IDs from the server:

net.inet.ip.random_id=1

Finally I redirect TCP ports 9090 and 9091 to ports 443 and 80 in the jail in /etc/pf.conf:

$ grep tor /etc/pf.conf
torv4="85.235.250.88"
torv6="2a01:3a0:1:1900:85:235:250:88"
rdr on $if inet proto tcp from any to $torv4 port 443 -> $torv4 port 9090
rdr on $if inet6 proto tcp from any to $torv6 port 443 -> $torv6 port 9090
rdr on $if inet proto tcp from any to $torv4 port 80 -> $torv4 port 9091
rdr on $if inet6 proto tcp from any to $torv6 port 80 -> $torv6 port 9091
pass in quick on { $if, $jailif } proto tcp from any to { $torv4 $torv6 } port { 9090, 9091 }
$

ZFS snapshots and backup

So, since all this is ZFS based, there is a few tricks I do to make it easier to restore data in case of accidental file deletion or other dataloss.

Periodic snapshots using sysutils/zfs-periodic

sysutils/zfs-periodic is a little script that uses the FreeBSD periodic(8) system to make snapshots of filesystems with regular intervals. It supports making hourly snapshots with a small change to periodic(8), but I've settled for daily, weekly and monthly snapshots on my servers.

After installing sysutils/zfs-periodic I add the following to /etc/periodic.conf:

#daily zfs snapshots
daily_zfs_snapshot_enable="YES"
daily_zfs_snapshot_pools="tank gelipool"
daily_zfs_snapshot_keep=7
daily_zfs_snapshot_skip="gelipool/backups"

#weekly zfs snapshots
weekly_zfs_snapshot_enable="YES"
weekly_zfs_snapshot_pools="tank gelipool"
weekly_zfs_snapshot_keep=5
weekly_zfs_snapshot_skip="gelipool/backups"

#monthly zfs snapshots
monthly_zfs_snapshot_enable="YES"
monthly_zfs_snapshot_pools="tank gelipool"
monthly_zfs_snapshot_keep=6
monthly_zfs_snapshot_skip="gelipool/backups"

#monthly zfs scrub
daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_threshold=30

Note that the last bit also enables a monthly scrub of the filesystem. Remember to change the pool name and remember to set the number of snapshots to retain to something appropriate. These things are always a tradeoff between diskspace and safety. Think it over and find some values that make you sleep well at night :)

After this has been running for a few days, you should have a bunch of daily snapshots:

$ zfs list -t snapshot | grep gelipool@ 
gelipool@daily-2012-09-02                                 0      -    31K  -
gelipool@daily-2012-09-03                                 0      -    31K  -
gelipool@daily-2012-09-04                                 0      -    31K  -
gelipool@daily-2012-09-05                                 0      -    31K  -

Back-to-back ZFS mirroring

I am lucky enough to have more than one of these jail hosts, which is the whole reason I started writing down how I configure them. One of the advantages to having more than one is that I can configure zfs send/receive jobs and make server A send it's data to server B, and vice versa.


Introduction

The concept is pretty basic, but as it often happens, security considerations turn what was a simple and elegant idea into something... else. To make the back-to-back backup scheme work without sacrificing too much security, I first make a jail on each jailhost called backup.jailhostname. This jail will have control over a designated zfs dataset which will house the backups sent from the other server.


Create ZFS dataset

First I create the zfs dataset:

$ sudo zfs create cryptopool/backups
$ sudo zfs set jailed=on cryptopool/backups

'jail' the new dataset

I create the jail like I normally do, but after creating it, I edit the ezjail config file and tell it which extra zfs dataset to use:

$ grep dataset /usr/local/etc/ezjail/backup_glas_tyknet_dk 
export jail_backup_glas_tyknet_dk_zfs_datasets="cryptopool/backups"

This makes ezjail run the zfs jail command with the proper jail id when the jail is started.

jail sysctl settings

I also add the following to the jails ezjail config:

# grep parameters /usr/local/etc/ezjail/backup_glas_tyknet_dk
export jail_backup_glas_tyknet_dk_parameters="allow.mount.zfs=1 enforce_statfs=1"

Configuring the backup jail

The jail is ready to run now, and inside the jail a zfs list looks like this:

$ zfs list
NAME                                USED  AVAIL  REFER  MOUNTPOINT
cryptopool                         3.98G  2.52T    31K  none
cryptopool/backups                   62K  2.52T    31K  none
$

I don't want to open up root ssh access to this jail, but the remote servers need to call zfs receive which requires root permissions. zfs allow to the rescue! zfs allow makes it possible to say "user X is permitted to do action Y on dataset Z" which is what I need here. In the backup jail I add a user called tykbackup which will be used as the user receiving the zfs snapshots from the remote servers.

I then run the following commands to allow the user to work with the dataset:

$ sudo zfs allow tykbackup atime,compression,create,mount,mountpoint,readonly,receive cryptopool/backups
$ sudo zfs allow cryptopool/backups
---- Permissions on cryptopool/backups -------------------------------
Local+Descendent permissions:
        user tykbackup atime,compression,create,mount,mountpoint,readonly,receive
$ 

Testing if it worked:

$ sudo su tykbackup
$ zfs create cryptopool/backups/test
$ zfs list cryptopool/backups/test
NAME                      USED  AVAIL  REFER  MOUNTPOINT
cryptopool/backups/test    31K  2.52T    31K  none
$ zfs destroy cryptopool/backups/test
cannot destroy 'cryptopool/backups/test': permission denied
$ 

Since the user tykbackup only has the permissions create,mount,mountpoint,receive on the cryptopool/backups dataset, I get Permission Denied (as I expected) when trying to destroy cryptopool/backups/test. Works like a charm.

To allow automatic SSH operations I add the public ssh key for the root user of the server being backed up to /usr/home/tykbackup/.ssh/authorized_keys:

$ cat /usr/home/tykbackup/.ssh/authorized_keys
from="ryst.tyknet.dk",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty,command="/usr/home/tykbackup/zfscmd.sh $SSH_ORIGINAL_COMMAND" ssh-rsa AAAAB3......KR2Z root@ryst.tyknet.dk

The script called zfscmd.sh is placed on the backup server to allow the ssh client to issue different command line arguments depending on what needs to be done. The script is very simple:

#!/bin/sh
shift
/sbin/zfs $@
exit $?

A few notes: Aside from restricting the command this SSH key can run, I've restricted it to only be able to log in from the IP of the server being backed up. These are very basic restrictions that should always be in place no matter what kind of backup you are using.

Add the periodic script

I then add the script /usr/local/etc/periodic/daily/999.zfs-mirror to each server being backed up with the following content:

#!/bin/sh
#set -x

### check pidfile
if [ -f /var/run/$(basename $0).pid ]; then
        echo "pidfile /var/run/$(basename $0).pid exists, bailing out"
        exit 1
fi
echo $$ > /var/run/$(basename $0).pid


### If there is a global system configuration file, suck it in.
if [ -r /etc/defaults/periodic.conf ]; then
        . /etc/defaults/periodic.conf
        source_periodic_confs
fi

case "$daily_zfs_mirror_enable" in
    [Yy][Ee][Ss])
        ;;
    *)
        exit
        ;;
esac

pools=$daily_zfs_mirror_pools
if [ -z "$pools" ]; then
        pools='tank'
fi

targethost=$daily_zfs_mirror_targethost
if [ -z "$targethost" ]; then
        echo '$daily_zfs_mirror_targethost must be set in /etc/periodic.conf'
        exit 1
fi

targetuser=$daily_zfs_mirror_targetuser
if [ -z "$targetuser" ]; then
        echo '$daily_zfs_mirror_targetuser must be set in /etc/periodic.conf'
        exit 1
fi

targetfs=$daily_zfs_mirror_targetfs
if [ -z "$targetfs" ]; then
        echo '$daily_zfs_mirror_targetfs must be set in /etc/periodic.conf'
        exit 1
fi

if [ -n "$daily_zfs_mirror_skip" ]; then
        egrep="($(echo $daily_zfs_mirror_skip | sed "s/ /|/g"))"
fi

### get todays date for later use
tday=$(date +%Y-%m-%d)

### check if the destination fs exists
ssh ${targetuser}@${targethost} /usr/home/tykbackup/zfscmd.sh list ${targetfs} > /dev/null 2>&1
if [ $? -ne 0 ]; then
        echo "Creating destination fs on target server"
        echo ssh ${targetuser}@${targethost} /usr/home/tykbackup/zfscmd.sh create ${targetfs}
        ssh ${targetuser}@${targethost} /usr/home/tykbackup/zfscmd.sh create ${targetfs}
fi

echo -n "Doing daily ZFS mirroring - "
date

### loop through the configured pools
for pool in $pools; do
        echo "    Processing pool $pool ..."
        ### enumerate datasets with daily snapshots from today
        if [ -n "$egrep" ]; then
                datasets=$(zfs list -t snapshot -o name | grep "^$pool[\/\@]" | egrep -v "$egrep" | grep "@daily-$tday")
        else
                datasets=$(zfs list -t snapshot -o name | grep "^$pool[\/\@]" | grep "@daily-$tday")
        fi

        echo "found datasets: $datasets"

        for snapshot in $datasets; do
                dataset=$(echo -n $snapshot | cut -d "@" -f 1)
                echo "working on dataset $dataset"
                ### find the latest daily snapshot of this dataset on the remote node, if any
                echo ssh ${targetuser}@${targethost} /usr/home/tykbackup/zfscmd.sh list -t snapshot \| grep "^${targetfs}/${dataset}@daily-" \| cut -d " " -f 1 \| tail -1
                lastgoodsnap=$(ssh ${targetuser}@${targethost} /usr/home/tykbackup/zfscmd.sh list -t snapshot | grep "^${targetfs}/${dataset}@daily-" | cut -d " " -f 1 | tail -1)
                if [ -z $lastgoodsnap ]; then
                        echo "No remote daily snapshot found for local daily snapshot $snapshot - cannot send incremental - sending full backup"
                        zfs send -v $snapshot | mbuffer | ssh ${targetuser}@${targethost} /usr/home/tykbackup/zfscmd.sh receive -v -F -u $targetfs/$dataset
                        if [ $? -ne 0 ]; then
                                echo "    Unable to send full snapshot of $dataset to $targetfs on host $targethost"
                        else
                                echo "    Successfully sent a full snapshot of $dataset to $targetfs on host $targethost - future sends will be incremental"
                        fi
                else
                        ### check if this snapshot has already been sent for some reason, skip if so..."
                        temp=$(echo $snapshot | cut -d "/" -f 2-)
                        lastgoodsnap="$(echo $lastgoodsnap | sed "s,${targetfs}/,,")"
                        if [ "$temp" = "$lastgoodsnap" ]; then
                                echo "    The snapshot $snapshot has already been sent to $targethost, skipping..."
                        else
                                ### zfs send the difference between latest remote snapshot and todays local snapshot
                                echo "    Sending the diff between local snapshot $(hostname)@$lastgoodsnap and $(hostname)@$pool/$snapshot to ${targethost}@${targetfs}/${pool} ..."
                                zfs send -I $lastgoodsnap $snapshot | mbuffer | ssh ${targetuser}@${targethost} /usr/home/tykbackup/zfscmd.sh receive -v -F -u $targetfs/$dataset
                                if [ $? -ne 0 ]; then
                                        echo "    There was a problem sending the diff between $lastgoodsnap and $snapshot to $targetfs on $targethost"
                                else
                                        echo "    Successfully sent the diff between $lastgoodsnap and $snapshot to $targethost"
                                fi
                        fi
                fi
        done
done

### remove pidfile
rm /var/run/$(basename $0).pid

Run the periodic script

I usually to the initial run of the periodic script by hand, so I can catch and fix any errors right away. The script will loop over all datasets in the configured pools and zfs send them including their snapshots to the backup server. Next time the script runs it will send an incremental diff instead of the full dataset.

Caveats

This script does not handle deleting datasets (including their snapshots) on the backup server when the dataset is deleted from the server being backed up. You will need to do that manually. This could be considered a feature, or a missing feature, depending on your preferences. :)

Staying up-to-date

I update my ezjail hosts and jails to track -STABLE regularly. This section describes the procedure I use. It is essential that the jail host and the jails use the same world version, or bad stuff will happen.

Updating the jail host

First I update world and kernel of the jail host like I normally would. This is described earlier in this guide, see Ezjail_host#Building_world_and_kernel.

Updating ezjails basejail

To update ezjails basejail located in /usr/jails/basejail, I run the same commands as when bootstrapping ezjail, see the section Ezjail_host#Bootstrapping_ezjail.

Running mergemaster in the jails

Finally, to run mergemaster in all jails I use the following script. It will run mergemaster in each jail, the script comments should explain the rest. When it is finished the jails can be started:

#! /bin/sh

### check if .mergemasterrc exists,
### move it out of the way if so
MM_RC=0
if [ -e /root/.mergemasterrc ]; then
        MM_RC=1
        mv /root/.mergemasterrc /root/.mergemasterrc.old
fi

### loop through jails
for jailname in $(ls -1 /usr/jails/ | grep -Ev "(^basejail$|^newjail$|^flavours$)"); do
        jailroot="/usr/jails/${jailname}"
        echo "processing ${jailroot}:"
        ### check if jailroot exists
        if [ -n "${jailroot}" -a -d "${jailroot}" ]; then
                ### create .mergemasterrc
                cat <<EOF > /root/.mergemasterrc
AUTO_INSTALL=yes
AUTO_UPGRADE=yes
FREEBSD_ID=yes
PRESERVE_FILES=yes
PRESERVE_FILES_DIR=/var/tmp/mergemaster/preserved-files-$(basename ${jailroot})-$(date +%y%m%d-%H%M%S)
IGNORE_FILES="/boot/device.hints /etc/motd"
EOF
                ### remove backup of /etc from previous run (if it exists)
                if [ -d "${jailroot}/etc.bak" ]; then
                        rm -rfI "${jailroot}/etc.bak"
                fi

                ### create backup of /etc as /etc.bak
                cp -pRP "${jailroot}/etc" "${jailroot}/etc.bak"

                ### check if mtree from last mergemaster run exists
                if [ ! -e ${jailroot}/var/db/mergemaster.mtree ]; then
                        ### delete /etc/rc.d/*
                        rm -rfI ${jailroot}/etc/rc.d/*
                fi
                ### run mergemaster for this jail
                mergemaster -D "${jailroot}"
        else
                echo "${jailroot} doesn't exist"
        fi
        sleep 2
done

### if an existing .mergemasterrc was moved out of the way in the beginning, move it back now
if [ ${MM_RC} -eq 1 ]; then
        mv /root/.mergemasterrc.old /root/.mergemasterrc
else
        rm /root/.mergemasterrc
fi

### done, a bit of output
echo "Done. If everything went well the /etc.bak backup folders can be deleted now."
exit 0

To restart all jails I run the command ezjail-admin restart.

Replacing a defective disk

I had a broken harddisk on one of my servers this evening. This section describes how I replaced the disk to make everything work again.

Booting into the rescue system

After Hetzner staff physically replaced the disk my server was unable to boot because the disk that died was the first one on the controller. The cheap Hetzner hardware is unable to boot from the secondary disk, bios restrictions probably. If the other disk had broken the server would have booted fine and this whole process would be done with the server running. Anyway, I booted into the rescue system and partitioned the disk, added a bootloader and added it to the root zpool. After this I was able to boot the server normally, so the rest of the work was done without the rescue system.

Partitioning the new disk

The following shows the commands I ran to partition the disk:

[root@rescue ~]# gpart create -s GPT /dev/ad4
ad4 created
[root@rescue ~]# /sbin/gpart add -b 2048 -t freebsd-boot -s 128 /dev/ad4
ad4p1 added
[root@rescue ~]# gpart add -t freebsd-zfs -s 30G /dev/ad4
ad4p2 added
[root@rescue ~]# gpart add -t freebsd-ufs /dev/ad4
ad4p3 added
[root@rescue ~]# gpart show
=>        34  1465149101  ad6  GPT  (698G)
          34        2014       - free -  (1M)
        2048         128    1  freebsd-boot  (64k)
        2176    62914560    2  freebsd-zfs  (30G)
    62916736  1402232399    3  freebsd-ufs  (668G)

=>        34  1465149101  ad4  GPT  (698G)
          34        2014       - free -  (1M)
        2048         128    1  freebsd-boot  (64k)
        2176    62914560    2  freebsd-zfs  (30G)
    62916736  1402232399    3  freebsd-ufs  (668G)

[root@rescue ~]#

Importing the pool and replacing the disk

Next step is importing the zpool (remember altroot=/mnt !) and replacing the defective disk:

[root@rescue ~]# zpool import
   pool: tank
     id: 3572845459378280852
  state: DEGRADED
 status: One or more devices are missing from the system.
 action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 config:

        tank                      DEGRADED
          mirror-0                DEGRADED
            11006001397618753837  UNAVAIL  cannot open
            ad6p2                 ONLINE
[root@rescue ~]# zpool import -o altroot=/mnt/ tank
[root@rescue ~]# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0h2m with 0 errors on Thu Nov  1 05:00:49 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        tank                      DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            11006001397618753837  UNAVAIL      0     0     0  was /dev/ada0p2
            ad6p2                 ONLINE       0     0     0

errors: No known data errors
[root@rescue ~]# zpool replace tank 11006001397618753837 ad4p2

Make sure to wait until resilver is done before rebooting.

If you boot from pool 'tank', you may need to update
boot code on newly attached disk 'ad4p2'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

        gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

[root@rescue ~]#
[root@rescue ~]# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Nov 27 00:24:41 2012
        823M scanned out of 3.11G at 45.7M/s, 0h0m to go
        823M resilvered, 25.88% done
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              11006001397618753837  UNAVAIL      0     0     0  was /dev/ada0p2
              ad4p2                 ONLINE       0     0     0  (resilvering)
            ad6p2                   ONLINE       0     0     0

errors: No known data errors
[root@rescue ~]# zpool status
  pool: tank
 state: ONLINE
  scan: resilvered 3.10G in 0h2m with 0 errors on Tue Nov 27 01:26:45 2012
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

errors: No known data errors
[root@rescue ~]#

Reboot into non-rescue system

At this point I rebooted the machine into the normal FreeBSD system.

Re-create geli partition

To recreate the geli partition on p3 of the new disk, I just follow the same steps as when I originally created it, more info here.

To attach the new geli volume I run geli attach as described here.

Add the geli device to the encrypted zpool

First I check that both geli devices are available, and I check the device name that needs replacing in zpool status output:

[tykling@haze ~]$ geli status
      Name  Status  Components
ada1p3.eli  ACTIVE  ada1p3
ada0p3.eli  ACTIVE  ada0p3
[tykling@haze ~]$ zpool status gelipool
  pool: gelipool
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 68K in 0h28m with 0 errors on Thu Nov  1 05:58:08 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        gelipool                  DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            18431995264718840299  REMOVED      0     0     0  was /dev/ada0p3.eli
            ada1p3.eli            ONLINE       0     0     0

errors: No known data errors
[tykling@haze ~]$

To replace the device and begin resilvering:

[tykling@haze ~]$ sudo zpool replace gelipool 18431995264718840299 ada0p3.eli
Password:
[tykling@haze ~]$ zpool status
  pool: gelipool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Nov 27 00:53:40 2012
        759M scanned out of 26.9G at 14.6M/s, 0h30m to go
        759M resilvered, 2.75% done
config:

        NAME                        STATE     READ WRITE CKSUM
        gelipool                    DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             REMOVED      0     0     0
              18431995264718840299  REMOVED      0     0     0  was /dev/ada0p3.eli/old
              ada0p3.eli            ONLINE       0     0     0  (resilvering)
            ada1p3.eli              ONLINE       0     0     0

errors: No known data errors
[tykling@haze ~]$

When the resilver is finished, the system is good as new.