Fixing a bad block in an ext4 partition on an advanced format SATA drive

I recently got an email from my home server, warning me that it had detected an error on one of its hard drives. This was automatically generated by smartd, part of SmartMonTools, that monitors the health of the disk storage attached to my server by running a series of regular tests without my intervention.

To find out what had exactly been found, I used the smartctl command to see the logged results of the last few self-tests. As you can see, the daily Short Offline tests were all passing successfully, but the long-running weekly Extended Offline tests were showing a problem with the same LBA on each run, namely LBA 1318075984:


# smartctl -l xselftest /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, http://www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 9559 -
# 2 Short offline Completed without error 00% 9535 -
# 3 Short offline Completed without error 00% 9511 -
# 4 Extended offline Completed: read failure 70% 9490 1318075984
# 5 Short offline Completed without error 00% 9487 -
# 6 Short offline Completed without error 00% 9463 -
# 7 Short offline Completed without error 00% 9440 -
# 8 Short offline Completed without error 00% 9416 -
# 9 Short offline Completed without error 00% 9392 -
#10 Short offline Completed without error 00% 9368 -
#11 Short offline Completed without error 00% 9344 -
#12 Extended offline Completed: read failure 70% 9322 1318075984
#13 Short offline Completed without error 00% 9320 -
#14 Short offline Completed without error 00% 9296 -
#15 Extended offline Completed without error 00% 9204 -
#16 Short offline Completed without error 00% 9198 -
#17 Short offline Completed without error 00% 9176 -
#18 Short offline Completed without error 00% 9152 -

The fact that this is a “read failure” probably means that this is a medium error. That can usually be resolved by writing fresh data to the block. This will either succeed (in the case of a transient problem), or cause the drive to reallocate a spare block to replace the now-failed block. The problem, of course, is that that block might be part of some important piece of data. Fortunately I have backups. But I’d prefer to restore only the damaged file, rather than the whole disk. The rest of this post discusses how to achieve that.

Firstly we need to look at the disk layout to determine what partition the affected block falls within:


gdisk -l /dev/sda
GPT fdisk (gdisk) version 0.8.8

Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sda: 3907029168 sectors, 1.8 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 453C41A1-848D-45CA-AC5C-FC3FE68E8280
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 3907029134
Partitions will be aligned on 2048-sector boundaries
Total free space is 2157 sectors (1.1 MiB)

Number Start (sector) End (sector) Size Code Name
1 2048 1050623 512.0 MiB EF00
2 1050624 4956159 1.9 GiB 8300
3 4956160 44017663 18.6 GiB 8300
4 44017664 52219903 3.9 GiB 8200
5 52219904 61984767 4.7 GiB 8300
6 61984768 940890111 419.1 GiB 8300
7 940890112 3907028991 1.4 TiB 8300

We can see that the disk uses logical 512 byte sectors, and that the failing sector is in partion /dev/sda7. We also want to know (for later) what is the physical block size for this disk, which can be found by:


# cat /sys/block/sda/queue/physical_block_size
4096

Since this is larger than the LBA size (of 512 bytes) it means that it’s actually the physical block that contains LBA 1318075984 that is failing, and therefore so will be all the other LBAs in that physical block. In this case, that means 8 LBAs. Because of the way the SMART selftests work, it’s likely that 1318075984 and the following 7 will be failing, but we can test that later.

Next we need to understand what filesystem that partition has been formatted as. I happen to know that all my partitions are formatted as ext4 on this system, but you could find this out this information from the /etc/fstab configuration file.

The rest of this post is only directly relevant to ext4/3/2 filesystems. Feel free to use the general process, but please look elsewhere for detailed instructions for BTRFS, XFS, etc etc.

Next thing to do is to determine the offset of the failing LBA into the sda7 partition. So, 1318075984 – 940890112, which is 377185872 blocks of 512 bytes. We now need to know how many filesystem blocks that is, so lets find out what blocksize that partition is using:


# tune2fs -l /dev/sda7 | grep Block
Block count: 370767360
Block size: 4096
Blocks per group: 32768

So, each filesystem block is 4096 bytes. To determine the offset of the failing LBA in the filesystem, we divide the LBA offset into the filesystem by 8 (4096/512), giving us a filesystem offset of 47148234. Since this is an exact result, we know it happens to be the first logical LBA in that filesystem block that is causing the error (as we expected).

Next we want to know if that LBA is in use, or part of the filesystems free space:


# debugfs /dev/sda7
debugfs 1.42.9 (4-Feb-2014)
debugfs: testb 47148234
Block 47148234 marked in use

So we know that filesystem block is part of a file – unfortunately. The question is which one?


debugfs: icheck 47148234
Block Inode number
47148234 123993
debugfs: ncheck 123993
Inode Pathname
123993 /media/homevideo/AA-20140806.mp4

Since the filesystem block size and the physical disk block size are the same, I could just assume that thats the only block affected. But that’s probably not very wise. So lets check the physical blocks (on the disk) before and after the one we know is failing by asking for the failing LBA + and – 8 LBA’s:


# # The reported failing LBA:
# dd if=/dev/sda of=sector.bytes skip=1318075984 bs=512 count=1
dd: error reading ‘/dev/sda’: Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 2.85748 s, 0.0 kB/s
# # The reported failing LBA - 8:
# dd if=/dev/sda of=sector.bytes skip=1318075976 bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0246482 s, 20.8 kB/s
# # The reported failing LBA + 8:
# dd if=/dev/sda of=sector.bytes skip=1318075992 bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0246482 s, 20.8 kB/s

So, in this case we can see that the physical blocks before and after the failing disk block are both currently readable, meaning that we only need to deal with the single failing block.

Since I have a backup of the file that this failing block occurs in, we’ll complete the resolution of the problem by overwriting the failing physical sector with zeros (definitely corrupting the containing file) and triggering the drives block-level reallocation routines, and then delete the file, prior to recovering it from backup:


# dd if=/dev/zero of=/dev/sda skip=1318075984 bs=512 count=8
8+0 records in
8+0 records out
4096 bytes (4.1 kB) copied, 0.000756974 s, 5.4 MB/s
# rm /media/homevideo/AA-20140806.mp4
# dd if=/dev/sda of=sector.bytes skip=1318075984 bs=512 count=8
8+0 records in
8+0 records out
4096 bytes (4.1 kB) copied, 0.000857198 s, 4.8 MB/s

At this point I could run an immediate Extended Offline self-test, but I’m confident that as I can now successfully read the originally-failing block, the problem is solved, and I’ll just wait for the normally scheduled self-tests to be run by smartd again.

Update: I’ve experienced a situation where overwriting the failing physical sector with zeros using dd failed to trigger the drives automatic block reallocation routines. However, in that case I was able to resolve the situation by using hdparm instead. Use:

hdparm --read-sector 1318075984 /dev/sda

to (try to) read and display a block of data, and

hdparm --write-sector 1318075984 --yes-i-know-what-i-am-doing /dev/sda

to overwrite the block with zeros. Both these commands use the drives logical block size (normally 512 bytes) not the 4K physical sector size.

Updated weatherstation software

I’ve been enhancing the software that I use to read data from my weatherstation. It’s been working well since I added some extra code to detect when the sensor readings are obviously corrupt (radio interference, uncontrolled concurrent memory accesses, etc), but the tracking of the wind direction was still not quite good enough.

To improve that, I’ve extended the number of readings used in the running average, and enhanced the algorithms that average the wind direction to take into account not only the direction of each data point (unity vectors), but also the speed of the wind in that direction (true vectors).

The result seems to track significantly more accurately to nearby high quality stations, but I am conscious that this is still presenting a manipulation of the data, rather than the actual data that a better sensor would provide. Having said that, it’s now producing a pretty good result for hardware that cost me only £50.

The software can be downloaded from here.

Playing with callerid again

Back in April last year I started playing with an old external serial-attached modem to read the callerid of incoming calls. My intention was to intercept calls from direct marketeers. The concept was good, but I ran into problems with the modem; it took up a lot of space, kept overheating, and lacked any voice facilities, limiting what I could do with it. In addition (probably because of the modem constantly overheating) the software I was running kept crashing.

So in the end, I gave up on the idea.

But recently we seem to have had a spate of annoying calls from direct marketeers based in India, selling products for UK companies that are cynically avoiding the UK’s regulations around direct marketing opt-outs. The straw that broke the camels back was the call that came through at 6am on a Saturday morning.

The problem here is that the phone companies don’t care about this. They make money from all these calls, so its not in their interest to block them. They’ll sell me a service to block “withheld” numbers, but not numbers that show as “unavailable”. Unfortunately, these days the majority of the problem calls come from India, and show as “unavailable” because the Indian call centers are using VoIP systems to place their calls to the UK, and they deliberately ensure that their callerid is “unavailable”.

So I’m back on the idea of making my own solution to this problem. So first off, I purchased a USR 5637 USB fax modem that is compatible with the UK callerid protocols. Even better, this is a voice modem too, so it can interact with a caller, sending audio as well as data down the phone line, and recognise touchtones. It’s also small, self-powered, cool-running and very reliable.

Next I spent some time looking to see what other people have done in this space, and eventually found this webpage, that describes a simple Bash script that intercepts calls that are not on a whitelist, and plays modem tones to them before hanging up. Recognised callers are allowed to ring the phone as normal, progressing to an answerphone if necessary. It’s not exactly the functionality that I want, but the simplicity is beguiling, and it’s trivial to extend it to do something closer to what I do want. And anyway, anything more sophisticated is going to require something like Asterisk, and switching my whole phone system over to VoIP, which is not going to be very family-friendly.

So for now, I’m gathering lists of all incoming calls to establish a basic whitelist, before moving on to do some really basic screening of calls.

Plusnet IPv6 still delayed, so let’s go spelunking in a Hurricane Electric tunnel

When I last changed ISP (last year, when Sky bought my old ISP, BE Unlimited) one of my requirements was for my new ISP to have a roadmap for making IPv6 available. I’ve been waiting (impatiently) for my new ISP, Plusnet, to deliver on their initial “coming soon” statements. Sadly, like almost all ISPs, Plusnet are not moving quickly with IPv6, and are still investigating alternatives like carrier grade NAT to extend the life of IPv4. I can sympathise with this position – IPv6 has limited device support, most of their customers are not ready to adopt it, and trying to provide support for the necessary dual-stack environment would not be easy. But, the problem of IPv4 address exhaustion is not going away.

So at the end of last year they started their second controlled trial of IPv6. I was keen to join, but the conditions were onerous. I would get a second account, I would need to provide my own IPv6-capable router, I couldn’t have my existing IPv4 static IP address, I couldn’t have Reverse DNS on the line, and I had to commit to providing feedback on my “normal” workload. So much as I wanted to join the trial, I couldn’t, as I wouldn’t be able to run my mailserver.

So I decided to investigate alternatives until such time as Plusnet get native IPv6 support working. The default solution in cases like mine, where my ISP only provides me with an IPv4 connection, is to tunnel IPv6 conversations through my IPv4 connection, to an ISP who does provide IPv6 connectivity to the Internet. There are two major players in this area for home users, SisXS and Hurricane Electric. Both provide all the basic functionality, as well as each having some individual specialist features. I’m just looking for a basic IPv6 connection and could use either, but in the end Hurricane Electric appeared vastly easier to register with, so I went with them.

My current internet connection is FTTC (fibre to the cabinet) via a BT OpenReach VDSL2 modem and my ISP-supplied (cheap and nasty) combined broadband router, NAT and firewall. This gives me a private 16bit IPv4 address space, for which my home server (a low-power mini-ITX system that runs 24×7) provides all the network management functions, such as DHCP and DNS.

What I want to add to this is a protocol-41 tunnel from the IPv6 ISP (Hurricane Electric, or HE) back through my NAT & Firewall to my home server. By registering for such a tunnel, HE provide (for free) a personal /64 subnet to me through that tunnel, allowing me to use my home server to provision IPv6 addresses to all the devices on my LAN. However, this connection is neither NAT’ed nor firewalled. The IPv6 addresses are both globally addressable and visible. So I also want my home server to act as a firewall for all IPv6 communications through that tunnel, to protect the devices on my network, without forcing them to all adopt their own firewalls. I was initially concerned that because my home server also acts as an OpenVPN endpoint, and so uses a bridged network configuration, getting the tunnel working might be quite awkward, but it turns out to make very little difference to the solution.

So, to make this work, first you need a static IPv4 address on the internet, and to have ensured that your router will respond to ICMP requests (pings!). Then you can register with Hurricane Electric, and “Create a Regular Tunnel”, which will result in a page of information describing your tunnel. I printed this for future reference (and in case I broke my network while making changes) but you can always access this information from the HE website.

You now need to edit /etc/network/interfaces. Add lines to define the tunnel, as follows, substituting the values from your tunnel description:

# Define 6in4 ipv6 tunnel to Hurricane Electric
auto he-ipv6
iface he-ipv6 inet6 v4tunnel
address [your "Client IPv6 Address"]
netmask 64
endpoint [your "Server IPv4 Address"]
ttl 255

up ip -6 route add default dev he-ipv6
down ip -6 route del default dev he-ipv6

Now add an address from your “Routed /64 IPv6 Prefix” to the appropriate interface – in my case, this is the bridge interface br0, but its more likely to be eth0 for you. This defines your servers static, globally accessible IPv6 address:

# Add an IPv6 address from the routed prefix to the br0 interface.
iface br0 inet6 static
address [any IPv6 address from the Routed /64 IPv6 Prefix]
netmask 64

Since I am running Ubuntu 12.04 I now need to install radvd, which will advertise the IPv6 subnet to any systems on our network that want to configure themselves an IPv6 connection. Think of it as a sort of DHCP for IPv6. However, when I move to 14.04 sometime later this year I expect to be able to get rid of radvd, and replace it with dnsamsq (which I already use for IPv4 DNS/DHCP), as the latest version of dnsmasq is reported to provide a superset of the radvd capabilities.

sudo apt-get update
sudo apt-get install radvd

Then configure radvd to give out IPv6 addresses from our Routed /64 IPv6 Prefix, by creating the file /etc/radvd.conf, and entering the following into it:

interface [your interface, probably eth0]
{
AdvSendAdvert on;
AdvLinkMTU 1480;
prefix [Your Routed /64 IPv6 Prefix, incl the /64]
{
AdvOnLink on;
AdvAutonomous on;
};
};

Any IPv6-capable devices will now ask for (and be allocated) an IPv6 address in your Routed /64 subnet, based on the MAC address of the interface that is requesting the IPv6 address.
Now uncomment the line:

# net.ipv6.conf.all.forwarding=1

from the file /etc/sysctl.conf. This will allow your server to act as a router for IPv6 traffic.

Now we need to enable and then configure the firewall. I take no credit for this, as much of the information related to the firewall was gleaned from this post. As I run Ubuntu Server I’ll use ufw, the Ubuntu Firewall utility to configure the underlying ipchains firewall. Alternative front-ends to ipchains will work equally well, though the actual method of configuration will obviously differ. First I needed to enable the firewall for IPv6 by editing /etc/default/ufw, and ensuring the following options are set correctly:

# Set to yes to apply rules to support IPv6 (no means only IPv6 on loopback
# accepted). You will need to 'disable' and then 'enable' the firewall for
# the changes to take affect.
IPV6=yes

and

# Set the default forward policy to ACCEPT, DROP or REJECT. Please note that
# if you change this you will most likely want to adjust your rules
DEFAULT_FORWARD_POLICY="ACCEPT"

Now we need to enable the firewall (by default it’s disabled) and add some additional rules to it:

# Enable the firewall
sudo ufw enable
# Allow everything on my LAN to connect to anything
sudo ufw allow from 192.168.0.0/16
# Allow Protocol-41 connections from the Tunnel Endpoint Server (to run the tunnel)
sudo ufw allow from [Your "Server IPv4 Address"] proto ipv6
# Allow BOOTP service on port 67 from radvd
sudo ufw allow proto any to any port 67
# Allow my IPv6 addresses to access services on this server
sudo ufw allow from [Your "Routed /64 IPv6 Prefix" including the "/64"]

I also had to add a few more rules to cope with the external facing services that my home server provides to the Internet (mail, web, ssh, ftp, vpn etc).

Finally I want to prevent all but a few specific types of external IPv6 connection to be made inbound into my network. To do this, edit the file /etc/ufw/before6.rules, and add the following lines directly BEFORE the “COMMIT” statement at the end of the file:


# Forward IPv6 packets associated with an established connection
-A ufw6-before-forward -i he-ipv6 -m state --state RELATED,ESTABLISHED -j ACCEPT

# Allow "No Next Header" to be forwarded or proto=59
# See http://www.ietf.org/rfc/rfc1883.txt (not sure if the length
# is needed as all IPv6 headers should be that size anyway).
-A ufw6-before-forward -p ipv6-nonxt -m length --length 40 -j ACCEPT

# allow MULTICAST to be forwarded
# These 2 need to be open to enable Auto-Discovery.
-A ufw6-before-forward -p icmpv6 -s ff00::/8 -j ACCEPT
-A ufw6-before-forward -p icmpv6 -d ff00::/8 -j ACCEPT

# ok icmp codes to forward
-A ufw6-before-forward -p icmpv6 --icmpv6-type destination-unreachable -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type packet-too-big -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type time-exceeded -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type parameter-problem -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type echo-request -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type echo-reply -j ACCEPT

# Don't forward any other packets to hosts behind this router.
-A ufw6-before-forward -i he-ipv6 -j ufw6-logging-deny
-A ufw6-before-forward -i he-ipv6 -j DROP

At this point I saved everything and rebooted (though you could just bring up the he-ipv6 interface) and everything came up correctly. I was able to test that I had a valid Global scope IPv6 address associated with (in my case) my br0 interface, and that I could successfully ping6 -c 5 ipv6.google.com via it. I was also able to check that my laptop had automatically found and configured a valid Global scope IPv6 address for it’s eth0 interface, and that it could ping6 my home server and external IPv6 sites, and that it was possible to browse IPv6-only websites from it.

Ditching the spinning rust

For some time now I’ve been thinking of switching my laptop storage over to an SSD. I like the idea of the massively improved performance, the slightly reduced power consumption, and the ability to better withstand the abuse of commuting. However, I don’t like the limited write cycles, or (since I need a reasonable size drive to hold all the data I’ve accumulated over the years) the massive price-premium over traditional drives. So I’ve been playing a waiting game over the last couple of years, and watching the technology develop.

But as the January sales started, I noticed the prices of 256GB SSDs have dipped to the point where I’m happy to “invest”. So I’ve picked up a Samsung 840 EVO 250GB SSD for my X201 Thinkpad; it’s essentially a mid-range SSD at a budget price-point, and should transform my laptops performance.

SSD’s are very different beasts from traditional hard drives, and from reading around the Internet there appear to be several things that I should take into account if I want to obtain and then maintain the best performance from it. Predominant amongst these are ensuring the correct alignment of partitions on the SSD, ensuring proper support for the Trim command, and selecting the best file system for my needs.

But this laptop is supplied to me by my employer, and must have full system encryption implemented on it. I can achieve this using a combination of LUKS and LVM, but it complicates the implementation of things like Trim support. The disk is divided into a minimal unencrypted boot partition with the majority of the space turned into a LUKS-encrypted block device. That is then used to create an LVM logical volume, from which are allocated the partitions for the actual Linux install.

Clearly once I started looking at partition alignment and different filesystem types a reinstall becomes the simplest option, and the need for Trim support predicates fairly recent versions of LUKS and LVM, driving me to a more recent distribution than my current Mint 14.1, which is getting rather old now. This gives me the opportunity to upgrade and fine-tune my install to better suit the new SSD. I did consider moving to the latest Mint 16, but my experiences with Mint have been quite mixed. I like their desktop environment very much, but am much less pleased with other aspects of the distribution, so I think I’ll switch back to the latest Ubuntu, but using the Cinnamon desktop environment from Mint; the best of all worlds for me.

Partition alignment

This article describes why there is a problem with modern drives that use 4k sectors internally, but represent themselves as having a 512byte sector externally. The problem is actually magnified with SSD’s where this can cause significant issues with excessive wearing of the cells. Worse still, modern SSDs like my Samsung write in 4K pages, but erase in 1M blocks of 256 pages. It means that partitions need to be aligned not to “just” 4K boundries, but to 1MB boundries.

Fortunately this is trivial in a modern Linux distribution; we partition the target drive with a GPT scheme using gdisk; on a new blank disk it will automatically align the partitions to 2048 sector, or 1MB boundries. On disks with existing partitions this can be enabled with the “l 2048” command in the advanced sub-menu, which will force alignment of newly created partitions on 1MB boundries.

Trim support

In the context of SSD’s TRIM is an ATA command that allows the operating system to tell the SSD which sectors are no longer in use, and so can be cleared, ready for rapid reuse. Wikipedia has some good information on it here. The key in my case is going to be to enable the filesystem to issue TRIM commands, and then enabling the LVM and LUKS containers that hold the filesystem to pass the TRIM commands on through to the actual SSD. There is more information on how to achieve this here.

However, there are significant questions over whether it is best to enable TRIM on the fstab options, getting the filesystem to issue TRIM commands automatically as it deletes sectors, or periodically running the user space command fstrim using something like a cron job or an init script. Both approaches still have scenarios that could result in significant performance degradation. At the moment I’m tending towards using fstrim in some fashion, but I need to do more research before making a final decision on this.

File system choice

Fundamentally I need a filesystem that supports the TRIM command – not all do. But beyond that I would expect any filesystem to perform better on an SSD than it does on a hard drive, which is good.

However, as you would expect, different filesystems have different strengths and weaknesses so by knowing my typical usage patterns I can select the “best” of the available filesystems for my system. And interestingly, according to these benchmarks, the LUKS/LVM containers that I will be forced to use can have a much more significant affect on some filesystems (particularly the almost default ext4) than others.

So based on my reading of these benchmarks and the type of use that I typically make of my machine, my current thought is to run an Ubuntu 13.10 install on BTRFS filesystems with lzo compression for both my root and home partitions, both hosted in a single LUKS/LVM container. My boot partition will be a totally separate ext3 partition.

The slight concern with that choice is that BTRFS is still considered “beta” code, and still under heavy development. It is not currently the default on any major distribution, but it is offered as an installation choice on almost all. The advanced management capabilities such as on-the-fly compression, de-duplication, snapshots etc make it very attractive though, and ultimately unless people like me do adopt it, it will never become the default filesystem.

I’ll be implementing a robust backup plan though!

Program a better windvane

Back in August I wrote about my weather station, some of the problems I’d experienced with it, and what I’d done to fix them. The one thing that I’d not been able to quickly solve was the lack of damping on the wind vane, which meant it was difficult to accurately track the wind direction.

Having done some research on the web, it seems that everyone has this problem with the wind vane; it’s fundamentally a bad design. Some people have tried modifying them, usually by adding a much larger tail-piece, which then needs a larger nose-cone to counterbalance it. It usually also means that the unit needs to be remounted to avoid the wind vane colliding with the anemometer.

For a while I toyed with the idea of following this route, and redesigning the wind vane. However, I could see that I would be signing myself up for a lot of messing around at the top of a ladder, and winter is very fast approaching. Not a terribly attractive option.

Meanwhile I’ve been rewriting the software that I use to capture the data from the weather station, before I send it on to my PWS on Weather Underground. The software had a couple of little bugs that I wanted to resolve, and lacked some functions that I wanted to add. So I wondered if I could do something about damping the wind vane in that software. It turns out that I can. Sort of.

The way the weather station appears to work, is that it has 4080 weather records in the console, that act as a circular buffer holding the weather history. By default, the console “creates” a new historical record every 30 minutes (giving an 85 day history) though this can be altered with software. The weather sensors however, are read at a fixed interval of about every 50 seconds, and are apparently always written to the current record. So with the default configuration the console only records the last of about 36 sets of readings.

However, by connecting to the console via USB, it’s possible to capture some of those intermediate readings, which allows us to do something helpful. In my case, I read the sensor data from the console’s current weather record every minute, creating a running average of the last “n” wind direction readings, before uploading it to Weather Underground. At the moment, n=10, which produces a significant reduction in the extremes of the readings.

Of course, this isn’t really damping the wind vane. Rather, it’s mathematically manipulating all the data points I can see (some of which I know will be inaccurate due to the sensor design) and removing the more extreme values from the set that I process. So we’re actually losing data here. But the proof is in the pudding, and the results seem to track more expensive weather station designs more accurately.

You can see this in the following series of images. This first one is an example plot of a day of raw wind direction data from my weatherstation:

Graph of undamped raw wind direction

This is a plot of the wind direction data from a different day, using a high quality weather station (a Davis Vantage Pro 2):

Graph of wind direction from a Davis Vantage Pro 2

And this is a plot of the wind direction on the same day, using my weather station, but with the damping function enabled:

Graph of wind direction from my weather station, after damping

Ok, it’s not perfect, but it’s a lot better than it was. And I know that the mounting location of the Davis Vantage Pro 2 sensors is much better than mine, so I’m unlikely to ever get results as good as the Davis set anyway.

For anyone interested in the damping, I create an array of historical wind direction data. I then take each element of that array in turn, and convert it into unit vectors for X and Y components of the angle. I then average the X and Y vectors, before turning the result back into an angle. By sampling frequently, and modifying the length of the history buffer, it is possible to significantly reduce the amount of “noise” from the sensor, and produce a much better track from the sensor data.

If that sounds too complicated, you have a Fine Offset 1080 or 1081 weatherstation such as the Maplin N96FY, and just want to get similar results to me, then you will soon be able to find all the code and instructions on how to use it here.

Irish backup: to be sure, to be sure …

As I centralise more and more digital content on my home server, my need for a decent backup strategy has dramatically increased. I now keep photographs, emails, music, video and the backups from all my other systems on that server, and losing that data would be a catastrophe.

Initially I ran my server with a RAID5 array, simply to achieve a large enough disk store at a reasonable cost. This gave me a little protection against an individual disk failure, even though I had no simple way to take cost-effective backups. However, technology moves rapidly, and I’ve now moved away from the complexity of a RAID array to a large single disk. However, should that disk fail, my entire server would fail, and all my data would be lost.

Initially I tried to mitigate this risk by using a second large hard drive in a USB caddy as a backup medium. Initially I simply plugged the drive in once a week, and manually copied over all the files I wanted to take a backup of. However, with nearly 1.5TB of data, it took hours to complete (not terribly practical) and to make matters worse, I kept forgetting to run the backup. Which was hopeless.

What I needed was something that took incremental backups – ie, only copy the things that had changed. So I decided to use Duplicity, since it was installed as the default backup program on Ubuntu. It first takes a complete initial backup, and then only captures the changes to any of those files. Which sounds great, but problems became apparent immediately; my initial backup took 6 days (yes, days) to complete. I hoped that having got that done, subsequent incremental backups would be significantly faster. And to be fair, the next backup was; but it still took nearly 4 days to complete.

This wasn’t working out well; my server was spending all it’s time (trying) to run backups (none of which were very chronologically consistent), and because my backup device was running all that time, it was aging (and therefore as susceptible to failure) as the main drive. Worse, rather than forgetting to start my backups, I was also now actively dissuaded from starting them. Overall, I would be as well off simply running the two drives in a RAID1 configuration and accepting the complexity and failure rates.

So after some research I switched to rsnapshot, which uses rsync and hard links to create a series of snapshots (as the name suggests) where there is only ever one instance of a given version of any file in the backup. This is exactly the same approach that Apple take with their Time Machine product. It both saves an enormous amount of space on the backup device, and is relatively fast (very fast compared to Duplicity!) in operation, taking only 30 minutes or so to process my 1.5TB.

At the same time, I have installed the disk that was in the USB caddy into my server as a second drive. This means I don’t have to remember to connect it to my server and power it up, and makes it easy to automate the backup process with a script and cron. However, it put me back in the situation where both the main and backup drives were spinning all the time, aging together, potentially failing close to one another. The solution to this is to use some fairly low level disk utilities to spin down the backup drive, and then mark it offline so it cannot be accidentally spun back up again. The backup script brings the disk back online and mounts it (spinning it up) prior to starting a backup, and then spins it down and takes it offline again afterwards.

For the curious, the commands to do that are (as root):
/bin/echo "running" > /sys/block/sdb/device/state
/bin/mount /dev/sdb1 /media/backup

and
/bin/umount /dev/sdb1
/sbin/hdparm -Y /dev/sdb
/bin/echo "offline" > /sys/block/sdb/device/state

I’ve also segregated the data I want to backup into two lists – stuff that changes a lot and needs frequent backups, and the rest. I then take two sets of backups; every day I take a “frequent” backup, and keep those for 14 days. Then in addition, once a week I backup everything and keep a rotation of 4 of those weekly backups, which feed into a monthly rotation of backups that are then kept for 12 months. So eventually I will have backups of (only) my frequently changing data covering the last 14 days, plus backups of everything, made on the last four Sundays, plus a further 12 backups of everything, made on the first Sunday of every month, for the last year. 30 backups in total.

The drawback (of course) is that this only protects me against hardware failure. There is no off-machine or off-site backup involved in this. So if my machine were to catch fire, it would be game over. However, if I look critically at the data, I could in extremis either stand the loss of the data or (with enough effort or money) reproduce everything except the photographs. So we also keep copies of the photographs (and only the photographs) in 2 separate cloud services, because as the title says, I want “to be sure, to be sure” 😉