Fixing a bad block in an ext4 partition on an advanced format SATA drive

I recently got an email from my home server, warning me that it had detected an error on one of its hard drives. This was automatically generated by smartd, part of SmartMonTools, that monitors the health of the disk storage attached to my server by running a series of regular tests without my intervention.

To find out what had exactly been found, I used the smartctl command to see the logged results of the last few self-tests. As you can see, the daily Short Offline tests were all passing successfully, but the long-running weekly Extended Offline tests were showing a problem with the same LBA on each run, namely LBA 1318075984:


# smartctl -l xselftest /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 9559 -
# 2 Short offline Completed without error 00% 9535 -
# 3 Short offline Completed without error 00% 9511 -
# 4 Extended offline Completed: read failure 70% 9490 1318075984
# 5 Short offline Completed without error 00% 9487 -
# 6 Short offline Completed without error 00% 9463 -
# 7 Short offline Completed without error 00% 9440 -
# 8 Short offline Completed without error 00% 9416 -
# 9 Short offline Completed without error 00% 9392 -
#10 Short offline Completed without error 00% 9368 -
#11 Short offline Completed without error 00% 9344 -
#12 Extended offline Completed: read failure 70% 9322 1318075984
#13 Short offline Completed without error 00% 9320 -
#14 Short offline Completed without error 00% 9296 -
#15 Extended offline Completed without error 00% 9204 -
#16 Short offline Completed without error 00% 9198 -
#17 Short offline Completed without error 00% 9176 -
#18 Short offline Completed without error 00% 9152 -

The fact that this is a “read failure” probably means that this is a medium error. That can usually be resolved by writing fresh data to the block. This will either succeed (in the case of a transient problem), or cause the drive to reallocate a spare block to replace the now-failed block. The problem, of course, is that that block might be part of some important piece of data. Fortunately I have backups. But I’d prefer to restore only the damaged file, rather than the whole disk. The rest of this post discusses how to achieve that.

Firstly we need to look at the disk layout to determine what partition the affected block falls within:


gdisk -l /dev/sda
GPT fdisk (gdisk) version 0.8.8

Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sda: 3907029168 sectors, 1.8 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 453C41A1-848D-45CA-AC5C-FC3FE68E8280
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 3907029134
Partitions will be aligned on 2048-sector boundaries
Total free space is 2157 sectors (1.1 MiB)

Number Start (sector) End (sector) Size Code Name
1 2048 1050623 512.0 MiB EF00
2 1050624 4956159 1.9 GiB 8300
3 4956160 44017663 18.6 GiB 8300
4 44017664 52219903 3.9 GiB 8200
5 52219904 61984767 4.7 GiB 8300
6 61984768 940890111 419.1 GiB 8300
7 940890112 3907028991 1.4 TiB 8300

We can see that the disk uses logical 512 byte sectors, and that the failing sector is in partion /dev/sda7. We also want to know (for later) what is the physical block size for this disk, which can be found by:


# cat /sys/block/sda/queue/physical_block_size
4096

Since this is larger than the LBA size (of 512 bytes) it means that it’s actually the physical block that contains LBA 1318075984 that is failing, and therefore so will be all the other LBAs in that physical block. In this case, that means 8 LBAs. Because of the way the SMART selftests work, it’s likely that 1318075984 and the following 7 will be failing, but we can test that later.

Next we need to understand what filesystem that partition has been formatted as. I happen to know that all my partitions are formatted as ext4 on this system, but you could find this out this information from the /etc/fstab configuration file.

The rest of this post is only directly relevant to ext4/3/2 filesystems. Feel free to use the general process, but please look elsewhere for detailed instructions for BTRFS, XFS, etc etc.

Next thing to do is to determine the offset of the failing LBA into the sda7 partition. So, 1318075984 – 940890112, which is 377185872 blocks of 512 bytes. We now need to know how many filesystem blocks that is, so lets find out what blocksize that partition is using:


# tune2fs -l /dev/sda7 | grep Block
Block count: 370767360
Block size: 4096
Blocks per group: 32768

So, each filesystem block is 4096 bytes. To determine the offset of the failing LBA in the filesystem, we divide the LBA offset into the filesystem by 8 (4096/512), giving us a filesystem offset of 47148234. Since this is an exact result, we know it happens to be the first logical LBA in that filesystem block that is causing the error (as we expected).

Next we want to know if that LBA is in use, or part of the filesystems free space:


# debugfs /dev/sda7
debugfs 1.42.9 (4-Feb-2014)
debugfs: testb 47148234
Block 47148234 marked in use

So we know that filesystem block is part of a file – unfortunately. The question is which one?


debugfs: icheck 47148234
Block Inode number
47148234 123993
debugfs: ncheck 123993
Inode Pathname
123993 /media/homevideo/AA-20140806.mp4

Since the filesystem block size and the physical disk block size are the same, I could just assume that thats the only block affected. But that’s probably not very wise. So lets check the physical blocks (on the disk) before and after the one we know is failing by asking for the failing LBA + and – 8 LBA’s:


# # The reported failing LBA:
# dd if=/dev/sda of=sector.bytes skip=1318075984 bs=512 count=1
dd: error reading ‘/dev/sda’: Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 2.85748 s, 0.0 kB/s
# # The reported failing LBA - 8:
# dd if=/dev/sda of=sector.bytes skip=1318075976 bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0246482 s, 20.8 kB/s
# # The reported failing LBA + 8:
# dd if=/dev/sda of=sector.bytes skip=1318075992 bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0246482 s, 20.8 kB/s

So, in this case we can see that the physical blocks before and after the failing disk block are both currently readable, meaning that we only need to deal with the single failing block.

Since I have a backup of the file that this failing block occurs in, we’ll complete the resolution of the problem by overwriting the failing physical sector with zeros (definitely corrupting the containing file) and triggering the drives block-level reallocation routines, and then delete the file, prior to recovering it from backup:


# dd if=/dev/zero of=/dev/sda skip=1318075984 bs=512 count=8
8+0 records in
8+0 records out
4096 bytes (4.1 kB) copied, 0.000756974 s, 5.4 MB/s
# rm /media/homevideo/AA-20140806.mp4
# dd if=/dev/sda of=sector.bytes skip=1318075984 bs=512 count=8
8+0 records in
8+0 records out
4096 bytes (4.1 kB) copied, 0.000857198 s, 4.8 MB/s

At this point I could run an immediate Extended Offline self-test, but I’m confident that as I can now successfully read the originally-failing block, the problem is solved, and I’ll just wait for the normally scheduled self-tests to be run by smartd again.

Update: I’ve experienced a situation where overwriting the failing physical sector with zeros using dd failed to trigger the drives automatic block reallocation routines. However, in that case I was able to resolve the situation by using hdparm instead. Use:

hdparm --read-sector 1318075984 /dev/sda

to (try to) read and display a block of data, and

hdparm --write-sector 1318075984 --yes-i-know-what-i-am-doing /dev/sda

to overwrite the block with zeros. Both these commands use the drives logical block size (normally 512 bytes) not the 4K physical sector size.

Updated weatherstation software

I’ve been enhancing the software that I use to read data from my weatherstation. It’s been working well since I added some extra code to detect when the sensor readings are obviously corrupt (radio interference, uncontrolled concurrent memory accesses, etc), but the tracking of the wind direction was still not quite good enough.

To improve that, I’ve extended the number of readings used in the running average, and enhanced the algorithms that average the wind direction to take into account not only the direction of each data point (unity vectors), but also the speed of the wind in that direction (true vectors).

The result seems to track significantly more accurately to nearby high quality stations, but I am conscious that this is still presenting a manipulation of the data, rather than the actual data that a better sensor would provide. Having said that, it’s now producing a pretty good result for hardware that cost me only £50.

The software can be downloaded from here.

Playing with callerid again

Back in April last year I started playing with an old external serial-attached modem to read the callerid of incoming calls. My intention was to intercept calls from direct marketeers. The concept was good, but I ran into problems with the modem; it took up a lot of space, kept overheating, and lacked any voice facilities, limiting what I could do with it. In addition (probably because of the modem constantly overheating) the software I was running kept crashing.

So in the end, I gave up on the idea.

But recently we seem to have had a spate of annoying calls from direct marketeers based in India, selling products for UK companies that are cynically avoiding the UK’s regulations around direct marketing opt-outs. The straw that broke the camels back was the call that came through at 6am on a Saturday morning.

The problem here is that the phone companies don’t care about this. They make money from all these calls, so its not in their interest to block them. They’ll sell me a service to block “withheld” numbers, but not numbers that show as “unavailable”. Unfortunately, these days the majority of the problem calls come from India, and show as “unavailable” because the Indian call centers are using VoIP systems to place their calls to the UK, and they deliberately ensure that their callerid is “unavailable”.

So I’m back on the idea of making my own solution to this problem. So first off, I purchased a USR 5637 USB fax modem that is compatible with the UK callerid protocols. Even better, this is a voice modem too, so it can interact with a caller, sending audio as well as data down the phone line, and recognise touchtones. It’s also small, self-powered, cool-running and very reliable.

Next I spent some time looking to see what other people have done in this space, and eventually found this webpage, that describes a simple Bash script that intercepts calls that are not on a whitelist, and plays modem tones to them before hanging up. Recognised callers are allowed to ring the phone as normal, progressing to an answerphone if necessary. It’s not exactly the functionality that I want, but the simplicity is beguiling, and it’s trivial to extend it to do something closer to what I do want. And anyway, anything more sophisticated is going to require something like Asterisk, and switching my whole phone system over to VoIP, which is not going to be very family-friendly.

So for now, I’m gathering lists of all incoming calls to establish a basic whitelist, before moving on to do some really basic screening of calls.

Plusnet IPv6 still delayed, so let’s go spelunking in a Hurricane Electric tunnel

When I last changed ISP (last year, when Sky bought my old ISP, BE Unlimited) one of my requirements was for my new ISP to have a roadmap for making IPv6 available. I’ve been waiting (impatiently) for my new ISP, Plusnet, to deliver on their initial “coming soon” statements. Sadly, like almost all ISPs, Plusnet are not moving quickly with IPv6, and are still investigating alternatives like carrier grade NAT to extend the life of IPv4. I can sympathise with this position – IPv6 has limited device support, most of their customers are not ready to adopt it, and trying to provide support for the necessary dual-stack environment would not be easy. But, the problem of IPv4 address exhaustion is not going away.

So at the end of last year they started their second controlled trial of IPv6. I was keen to join, but the conditions were onerous. I would get a second account, I would need to provide my own IPv6-capable router, I couldn’t have my existing IPv4 static IP address, I couldn’t have Reverse DNS on the line, and I had to commit to providing feedback on my “normal” workload. So much as I wanted to join the trial, I couldn’t, as I wouldn’t be able to run my mailserver.

So I decided to investigate alternatives until such time as Plusnet get native IPv6 support working. The default solution in cases like mine, where my ISP only provides me with an IPv4 connection, is to tunnel IPv6 conversations through my IPv4 connection, to an ISP who does provide IPv6 connectivity to the Internet. There are two major players in this area for home users, SisXS and Hurricane Electric. Both provide all the basic functionality, as well as each having some individual specialist features. I’m just looking for a basic IPv6 connection and could use either, but in the end Hurricane Electric appeared vastly easier to register with, so I went with them.

My current internet connection is FTTC (fibre to the cabinet) via a BT OpenReach VDSL2 modem and my ISP-supplied (cheap and nasty) combined broadband router, NAT and firewall. This gives me a private 16bit IPv4 address space, for which my home server (a low-power mini-ITX system that runs 24×7) provides all the network management functions, such as DHCP and DNS.

What I want to add to this is a protocol-41 tunnel from the IPv6 ISP (Hurricane Electric, or HE) back through my NAT & Firewall to my home server. By registering for such a tunnel, HE provide (for free) a personal /64 subnet to me through that tunnel, allowing me to use my home server to provision IPv6 addresses to all the devices on my LAN. However, this connection is neither NAT’ed nor firewalled. The IPv6 addresses are both globally addressable and visible. So I also want my home server to act as a firewall for all IPv6 communications through that tunnel, to protect the devices on my network, without forcing them to all adopt their own firewalls. I was initially concerned that because my home server also acts as an OpenVPN endpoint, and so uses a bridged network configuration, getting the tunnel working might be quite awkward, but it turns out to make very little difference to the solution.

So, to make this work, first you need a static IPv4 address on the internet, and to have ensured that your router will respond to ICMP requests (pings!). Then you can register with Hurricane Electric, and “Create a Regular Tunnel”, which will result in a page of information describing your tunnel. I printed this for future reference (and in case I broke my network while making changes) but you can always access this information from the HE website.

You now need to edit /etc/network/interfaces. Add lines to define the tunnel, as follows, substituting the values from your tunnel description:

# Define 6in4 ipv6 tunnel to Hurricane Electric
auto he-ipv6
iface he-ipv6 inet6 v4tunnel
address [your "Client IPv6 Address"]
netmask 64
endpoint [your "Server IPv4 Address"]
ttl 255

up ip -6 route add default dev he-ipv6
down ip -6 route del default dev he-ipv6

Now add an address from your “Routed /64 IPv6 Prefix” to the appropriate interface – in my case, this is the bridge interface br0, but its more likely to be eth0 for you. This defines your servers static, globally accessible IPv6 address:

# Add an IPv6 address from the routed prefix to the br0 interface.
iface br0 inet6 static
address [any IPv6 address from the Routed /64 IPv6 Prefix]
netmask 64

Since I am running Ubuntu 12.04 I now need to install radvd, which will advertise the IPv6 subnet to any systems on our network that want to configure themselves an IPv6 connection. Think of it as a sort of DHCP for IPv6. However, when I move to 14.04 sometime later this year I expect to be able to get rid of radvd, and replace it with dnsamsq (which I already use for IPv4 DNS/DHCP), as the latest version of dnsmasq is reported to provide a superset of the radvd capabilities.

sudo apt-get update
sudo apt-get install radvd

Then configure radvd to give out IPv6 addresses from our Routed /64 IPv6 Prefix, by creating the file /etc/radvd.conf, and entering the following into it:

interface [your interface, probably eth0]
{
AdvSendAdvert on;
AdvLinkMTU 1480;
prefix [Your Routed /64 IPv6 Prefix, incl the /64]
{
AdvOnLink on;
AdvAutonomous on;
};
};

Any IPv6-capable devices will now ask for (and be allocated) an IPv6 address in your Routed /64 subnet, based on the MAC address of the interface that is requesting the IPv6 address.
Now uncomment the line:

# net.ipv6.conf.all.forwarding=1

from the file /etc/sysctl.conf. This will allow your server to act as a router for IPv6 traffic.

Now we need to enable and then configure the firewall. I take no credit for this, as much of the information related to the firewall was gleaned from this post. As I run Ubuntu Server I’ll use ufw, the Ubuntu Firewall utility to configure the underlying ipchains firewall. Alternative front-ends to ipchains will work equally well, though the actual method of configuration will obviously differ. First I needed to enable the firewall for IPv6 by editing /etc/default/ufw, and ensuring the following options are set correctly:

# Set to yes to apply rules to support IPv6 (no means only IPv6 on loopback
# accepted). You will need to 'disable' and then 'enable' the firewall for
# the changes to take affect.
IPV6=yes

and

# Set the default forward policy to ACCEPT, DROP or REJECT. Please note that
# if you change this you will most likely want to adjust your rules
DEFAULT_FORWARD_POLICY="ACCEPT"

Now we need to enable the firewall (by default it’s disabled) and add some additional rules to it:

# Enable the firewall
sudo ufw enable
# Allow everything on my LAN to connect to anything
sudo ufw allow from 192.168.0.0/16
# Allow Protocol-41 connections from the Tunnel Endpoint Server (to run the tunnel)
sudo ufw allow from [Your "Server IPv4 Address"] proto ipv6
# Allow BOOTP service on port 67 from radvd
sudo ufw allow proto any to any port 67
# Allow my IPv6 addresses to access services on this server
sudo ufw allow from [Your "Routed /64 IPv6 Prefix" including the "/64"]

I also had to add a few more rules to cope with the external facing services that my home server provides to the Internet (mail, web, ssh, ftp, vpn etc).

Finally I want to prevent all but a few specific types of external IPv6 connection to be made inbound into my network. To do this, edit the file /etc/ufw/before6.rules, and add the following lines directly BEFORE the “COMMIT” statement at the end of the file:


# Forward IPv6 packets associated with an established connection
-A ufw6-before-forward -i he-ipv6 -m state --state RELATED,ESTABLISHED -j ACCEPT

# Allow "No Next Header" to be forwarded or proto=59
# See http://www.ietf.org/rfc/rfc1883.txt (not sure if the length
# is needed as all IPv6 headers should be that size anyway).
-A ufw6-before-forward -p ipv6-nonxt -m length --length 40 -j ACCEPT

# allow MULTICAST to be forwarded
# These 2 need to be open to enable Auto-Discovery.
-A ufw6-before-forward -p icmpv6 -s ff00::/8 -j ACCEPT
-A ufw6-before-forward -p icmpv6 -d ff00::/8 -j ACCEPT

# ok icmp codes to forward
-A ufw6-before-forward -p icmpv6 --icmpv6-type destination-unreachable -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type packet-too-big -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type time-exceeded -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type parameter-problem -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type echo-request -j ACCEPT
-A ufw6-before-forward -p icmpv6 --icmpv6-type echo-reply -j ACCEPT

# Don't forward any other packets to hosts behind this router.
-A ufw6-before-forward -i he-ipv6 -j ufw6-logging-deny
-A ufw6-before-forward -i he-ipv6 -j DROP

At this point I saved everything and rebooted (though you could just bring up the he-ipv6 interface) and everything came up correctly. I was able to test that I had a valid Global scope IPv6 address associated with (in my case) my br0 interface, and that I could successfully ping6 -c 5 ipv6.google.com via it. I was also able to check that my laptop had automatically found and configured a valid Global scope IPv6 address for it’s eth0 interface, and that it could ping6 my home server and external IPv6 sites, and that it was possible to browse IPv6-only websites from it.

Ditching the spinning rust

For some time now I’ve been thinking of switching my laptop storage over to an SSD. I like the idea of the massively improved performance, the slightly reduced power consumption, and the ability to better withstand the abuse of commuting. However, I don’t like the limited write cycles, or (since I need a reasonable size drive to hold all the data I’ve accumulated over the years) the massive price-premium over traditional drives. So I’ve been playing a waiting game over the last couple of years, and watching the technology develop.

But as the January sales started, I noticed the prices of 256GB SSDs have dipped to the point where I’m happy to “invest”. So I’ve picked up a Samsung 840 EVO 250GB SSD for my X201 Thinkpad; it’s essentially a mid-range SSD at a budget price-point, and should transform my laptops performance.

SSD’s are very different beasts from traditional hard drives, and from reading around the Internet there appear to be several things that I should take into account if I want to obtain and then maintain the best performance from it. Predominant amongst these are ensuring the correct alignment of partitions on the SSD, ensuring proper support for the Trim command, and selecting the best file system for my needs.

But this laptop is supplied to me by my employer, and must have full system encryption implemented on it. I can achieve this using a combination of LUKS and LVM, but it complicates the implementation of things like Trim support. The disk is divided into a minimal unencrypted boot partition with the majority of the space turned into a LUKS-encrypted block device. That is then used to create an LVM logical volume, from which are allocated the partitions for the actual Linux install.

Clearly once I started looking at partition alignment and different filesystem types a reinstall becomes the simplest option, and the need for Trim support predicates fairly recent versions of LUKS and LVM, driving me to a more recent distribution than my current Mint 14.1, which is getting rather old now. This gives me the opportunity to upgrade and fine-tune my install to better suit the new SSD. I did consider moving to the latest Mint 16, but my experiences with Mint have been quite mixed. I like their desktop environment very much, but am much less pleased with other aspects of the distribution, so I think I’ll switch back to the latest Ubuntu, but using the Cinnamon desktop environment from Mint; the best of all worlds for me.

Partition alignment

This article describes why there is a problem with modern drives that use 4k sectors internally, but represent themselves as having a 512byte sector externally. The problem is actually magnified with SSD’s where this can cause significant issues with excessive wearing of the cells. Worse still, modern SSDs like my Samsung write in 4K pages, but erase in 1M blocks of 256 pages. It means that partitions need to be aligned not to “just” 4K boundries, but to 1MB boundries.

Fortunately this is trivial in a modern Linux distribution; we partition the target drive with a GPT scheme using gdisk; on a new blank disk it will automatically align the partitions to 2048 sector, or 1MB boundries. On disks with existing partitions this can be enabled with the “l 2048” command in the advanced sub-menu, which will force alignment of newly created partitions on 1MB boundries.

Trim support

In the context of SSD’s TRIM is an ATA command that allows the operating system to tell the SSD which sectors are no longer in use, and so can be cleared, ready for rapid reuse. Wikipedia has some good information on it here. The key in my case is going to be to enable the filesystem to issue TRIM commands, and then enabling the LVM and LUKS containers that hold the filesystem to pass the TRIM commands on through to the actual SSD. There is more information on how to achieve this here.

However, there are significant questions over whether it is best to enable TRIM on the fstab options, getting the filesystem to issue TRIM commands automatically as it deletes sectors, or periodically running the user space command fstrim using something like a cron job or an init script. Both approaches still have scenarios that could result in significant performance degradation. At the moment I’m tending towards using fstrim in some fashion, but I need to do more research before making a final decision on this.

File system choice

Fundamentally I need a filesystem that supports the TRIM command – not all do. But beyond that I would expect any filesystem to perform better on an SSD than it does on a hard drive, which is good.

However, as you would expect, different filesystems have different strengths and weaknesses so by knowing my typical usage patterns I can select the “best” of the available filesystems for my system. And interestingly, according to these benchmarks, the LUKS/LVM containers that I will be forced to use can have a much more significant affect on some filesystems (particularly the almost default ext4) than others.

So based on my reading of these benchmarks and the type of use that I typically make of my machine, my current thought is to run an Ubuntu 13.10 install on BTRFS filesystems with lzo compression for both my root and home partitions, both hosted in a single LUKS/LVM container. My boot partition will be a totally separate ext3 partition.

The slight concern with that choice is that BTRFS is still considered “beta” code, and still under heavy development. It is not currently the default on any major distribution, but it is offered as an installation choice on almost all. The advanced management capabilities such as on-the-fly compression, de-duplication, snapshots etc make it very attractive though, and ultimately unless people like me do adopt it, it will never become the default filesystem.

I’ll be implementing a robust backup plan though!

Program a better windvane

Back in August I wrote about my weather station, some of the problems I’d experienced with it, and what I’d done to fix them. The one thing that I’d not been able to quickly solve was the lack of damping on the wind vane, which meant it was difficult to accurately track the wind direction.

Having done some research on the web, it seems that everyone has this problem with the wind vane; it’s fundamentally a bad design. Some people have tried modifying them, usually by adding a much larger tail-piece, which then needs a larger nose-cone to counterbalance it. It usually also means that the unit needs to be remounted to avoid the wind vane colliding with the anemometer.

For a while I toyed with the idea of following this route, and redesigning the wind vane. However, I could see that I would be signing myself up for a lot of messing around at the top of a ladder, and winter is very fast approaching. Not a terribly attractive option.

Meanwhile I’ve been rewriting the software that I use to capture the data from the weather station, before I send it on to my PWS on Weather Underground. The software had a couple of little bugs that I wanted to resolve, and lacked some functions that I wanted to add. So I wondered if I could do something about damping the wind vane in that software. It turns out that I can. Sort of.

The way the weather station appears to work, is that it has 4080 weather records in the console, that act as a circular buffer holding the weather history. By default, the console “creates” a new historical record every 30 minutes (giving an 85 day history) though this can be altered with software. The weather sensors however, are read at a fixed interval of about every 50 seconds, and are apparently always written to the current record. So with the default configuration the console only records the last of about 36 sets of readings.

However, by connecting to the console via USB, it’s possible to capture some of those intermediate readings, which allows us to do something helpful. In my case, I read the sensor data from the console’s current weather record every minute, creating a running average of the last “n” wind direction readings, before uploading it to Weather Underground. At the moment, n=10, which produces a significant reduction in the extremes of the readings.

Of course, this isn’t really damping the wind vane. Rather, it’s mathematically manipulating all the data points I can see (some of which I know will be inaccurate due to the sensor design) and removing the more extreme values from the set that I process. So we’re actually losing data here. But the proof is in the pudding, and the results seem to track more expensive weather station designs more accurately.

You can see this in the following series of images. This first one is an example plot of a day of raw wind direction data from my weatherstation:

Graph of undamped raw wind direction

This is a plot of the wind direction data from a different day, using a high quality weather station (a Davis Vantage Pro 2):

Graph of wind direction from a Davis Vantage Pro 2

And this is a plot of the wind direction on the same day, using my weather station, but with the damping function enabled:

Graph of wind direction from my weather station, after damping

Ok, it’s not perfect, but it’s a lot better than it was. And I know that the mounting location of the Davis Vantage Pro 2 sensors is much better than mine, so I’m unlikely to ever get results as good as the Davis set anyway.

For anyone interested in the damping, I create an array of historical wind direction data. I then take each element of that array in turn, and convert it into unit vectors for X and Y components of the angle. I then average the X and Y vectors, before turning the result back into an angle. By sampling frequently, and modifying the length of the history buffer, it is possible to significantly reduce the amount of “noise” from the sensor, and produce a much better track from the sensor data.

If that sounds too complicated, you have a Fine Offset 1080 or 1081 weatherstation such as the Maplin N96FY, and just want to get similar results to me, then you will soon be able to find all the code and instructions on how to use it here.

Irish backup: to be sure, to be sure …

As I centralise more and more digital content on my home server, my need for a decent backup strategy has dramatically increased. I now keep photographs, emails, music, video and the backups from all my other systems on that server, and losing that data would be a catastrophe.

Initially I ran my server with a RAID5 array, simply to achieve a large enough disk store at a reasonable cost. This gave me a little protection against an individual disk failure, even though I had no simple way to take cost-effective backups. However, technology moves rapidly, and I’ve now moved away from the complexity of a RAID array to a large single disk. However, should that disk fail, my entire server would fail, and all my data would be lost.

Initially I tried to mitigate this risk by using a second large hard drive in a USB caddy as a backup medium. Initially I simply plugged the drive in once a week, and manually copied over all the files I wanted to take a backup of. However, with nearly 1.5TB of data, it took hours to complete (not terribly practical) and to make matters worse, I kept forgetting to run the backup. Which was hopeless.

What I needed was something that took incremental backups – ie, only copy the things that had changed. So I decided to use Duplicity, since it was installed as the default backup program on Ubuntu. It first takes a complete initial backup, and then only captures the changes to any of those files. Which sounds great, but problems became apparent immediately; my initial backup took 6 days (yes, days) to complete. I hoped that having got that done, subsequent incremental backups would be significantly faster. And to be fair, the next backup was; but it still took nearly 4 days to complete.

This wasn’t working out well; my server was spending all it’s time (trying) to run backups (none of which were very chronologically consistent), and because my backup device was running all that time, it was aging (and therefore as susceptible to failure) as the main drive. Worse, rather than forgetting to start my backups, I was also now actively dissuaded from starting them. Overall, I would be as well off simply running the two drives in a RAID1 configuration and accepting the complexity and failure rates.

So after some research I switched to rsnapshot, which uses rsync and hard links to create a series of snapshots (as the name suggests) where there is only ever one instance of a given version of any file in the backup. This is exactly the same approach that Apple take with their Time Machine product. It both saves an enormous amount of space on the backup device, and is relatively fast (very fast compared to Duplicity!) in operation, taking only 30 minutes or so to process my 1.5TB.

At the same time, I have installed the disk that was in the USB caddy into my server as a second drive. This means I don’t have to remember to connect it to my server and power it up, and makes it easy to automate the backup process with a script and cron. However, it put me back in the situation where both the main and backup drives were spinning all the time, aging together, potentially failing close to one another. The solution to this is to use some fairly low level disk utilities to spin down the backup drive, and then mark it offline so it cannot be accidentally spun back up again. The backup script brings the disk back online and mounts it (spinning it up) prior to starting a backup, and then spins it down and takes it offline again afterwards.

For the curious, the commands to do that are (as root):
/bin/echo "running" > /sys/block/sdb/device/state
/bin/mount /dev/sdb1 /media/backup

and
/bin/umount /dev/sdb1
/sbin/hdparm -Y /dev/sdb
/bin/echo "offline" > /sys/block/sdb/device/state

I’ve also segregated the data I want to backup into two lists – stuff that changes a lot and needs frequent backups, and the rest. I then take two sets of backups; every day I take a “frequent” backup, and keep those for 14 days. Then in addition, once a week I backup everything and keep a rotation of 4 of those weekly backups, which feed into a monthly rotation of backups that are then kept for 12 months. So eventually I will have backups of (only) my frequently changing data covering the last 14 days, plus backups of everything, made on the last four Sundays, plus a further 12 backups of everything, made on the first Sunday of every month, for the last year. 30 backups in total.

The drawback (of course) is that this only protects me against hardware failure. There is no off-machine or off-site backup involved in this. So if my machine were to catch fire, it would be game over. However, if I look critically at the data, I could in extremis either stand the loss of the data or (with enough effort or money) reproduce everything except the photographs. So we also keep copies of the photographs (and only the photographs) in 2 separate cloud services, because as the title says, I want “to be sure, to be sure” 😉

Or, you could just look out the window…

For a couple of years now I’ve fancied the idea of installing a weather station at home. I had no specific requirement for it, I simply figured it would be a fun thing to hook up to my home server, allowing me to track the weather over time. I also had some vague thoughts about justifying the purchase by using a combination of the live and historical weather data, along with hysteresis data for my central heating system, to more efficiently control the temperature at home by doing things like predictive start up and shutdown of the central heating boiler.

I started by looking into building my own system from scratch, using a microcontroller and one-wire sensors. Cheap and easy for the basic temperature and pressure readings, but the difficult part was always going to be the mechanical parts for the rain gauge, wind direction and wind speed sensors. So I kept putting it off. Then I noticed Maplin were selling a complete wireless weatherstation with computer connectivity for only £50. I figured it was worth it just to get the sensors.

Of course it’s been built to a price (in China by a company called Fine Offset), and is a generic design that ends up being sold with a few variations under a variety of different brand names. Mine has sensors for temperature, humidity, rainfall, wind speed and wind direction mounted on a small mast, which then send their measurements via a 433Mhz wireless connection to a console (the base station) that has further temperature, humidity and pressure sensors, and a large LCD display and a USB port.

The console displays the current readings, and stores a historical log of the data that can be scrolled through at will. However, of more interest to me, once the console is connected to my home server I can grab the latest set of readings from all the sensors via a program. I can then graph that data, or upload it to online systems like WeatherUnderground or Xively, or even use it to optimise the control of my central heating system!

Of course, to get decent readings it’s necessary to mount the sensors where they can work at their best. Ideally you’d want to split the sensors up and mount them in different places; the wind sensors on a tall mast, well away from buildings, and the rain and temperature sensors low to the ground, carefully sited to get the best results.

But for me, this is meant to be a bit of fun, so I’m not going to get too hung up about any of that. In practice the best I can sensibly manage is put them all on a tall mast. I picked up a 6 foot cranked aluminium TV aerial mast and a zinc-plated wall bracket for about £15, and modified the original mast so it could be attached to the cranked mast.

So back at the end of April I mounted the whole lot on the side of my garage, so the sensors were positioned above the roof-line of the garage. I expected the wind direction to be influenced by the garage and other nearby buildings, and for the temperature sensor to over-read on particularly sunny days when the sun is shining directly on it, but I figured it ought to be fine for my purposes.

And for the first month or so, it basically worked. Except the base station seemed to have a very tenuous connection to the sensors on the mast. I reckon it maintained contact for no more than 10-20% of the time, so most of my readings were only of the base station sensors – which were all indoor, and of limited interest. Getting reconnected usually involved standing in the garden waving the base station around for a few minutes, and often having done that, it would lose contact again as I took it back indoors. Worse, after a few weeks the wind speed indicator stopped working. It wouldn’t rotate in anything less than about a force 4 or 5 wind, and even then it wouldn’t indicate the correct wind speed. I also noticed that the wind direction sensor tended not to produce very consistent results, but rather, would “helicopter” all over the place in anything other than the most stable of gentle breezes.

This was not what I had hoped for, but I just didn’t have the spare time to fight with it in the run up to my operation in July, so everything stagnated for a time. But with my operation delayed, and me taking some vacation, I finally got around to looking into the problems with the weather station this week.

The first thing to solve was the wind speed sensor. Taking it apart reveals that there is a simple reed switch in the base of the unit. Gently levering the spinning cups off the base of the wind speed unit reveals a magnet, attached to the spinning cups, which triggers the reed switch. The spinning cups are attached to the base of the unit by way of a mini-bearing (5x10x4mm) which allows it to spin easily. Or in my case, not, as the bearing had failed, producing a noticeable sticking point in the rotation. So I ordered a pair of new bearings from Technobots for about £1.40. Repairing was a simple matter of removing the old bearing, pressing on the new one, replacing the spinning cups, and then making sure everything was carefully aligned so it would spin smoothly again.

I feared that improving the connectivity between the console and the sensors would be a lot more difficult, as I can only think of a few ways to improve a radio link:

  • Improve the siting of the transmitter: Not easy when mine is already at the top of a tall mast!
  • Improve the siting of the receiver: I can’t really do much to improve matters, as I’m constrained by the need to connect it to my home server.
  • Improve the output of the transmitter: there’s not much that can be legally done to 433Mhz kit while staying legal. Messing with this would be a last resort, especially as whatever I did would need to operate on 3xAA batteries, at the top of a mast, and be weatherproof!
  • Improve the sensitivity of the receiver: Really the only option for me, which actually comes down to improving the receiving antenna.

So I disassembled the console. And discovered that the internal antenna consists of a 1/4 wavelength piece of unshielded wire, wrapped around the edge of the case. Probably about as good for communications as the proverbial piece of damp string. But the good news is that the back of the case is practically empty, and there is a nice flat, horizontal surface on the top edge of the back panel. A quick trawl through the RS Components catalog reveals that they stock inexpensive helically wound stub antenna for 433MHz telemetry equipment. Add an appropriate panel mounting, a little bit of spare coaxial cable, and with a little careful soldering I now have a console with a removable external antenna.

The result? The console instantly connected to the sensors as soon as the batteries were installed in my study. No need to take it for a walk in the garden and wave it around under the sensors any more. And as far as I can tell from my Weather Underground page, the console hasn’t lost contact with the sensors since. A spectacularly good result for £5 of components and a little effort.

However, that page clearly shows the problem with the wind vane. If you look at the plot, it’s clear that the wind is generally from the SouthWest, but the data points are all over the place because (I think) the sensor lacks damping. So for now I’m researching ways of damping it that won’t spoil the accuracy. What is needed is something that resists sudden movements, but will happily respond to slow ones, even when the force is very low. Magnetic damping is probably the best option, but given that the vane uses a magnet and reed switches to detect its orientation, that might be hard to arrange. Given the cost, I’m tempted to get a spare vane to experiment on. And use as spares if (when?!) I break the the original.

Sticky tape

Or, in this case, not.

In general, the flexible LED tapes come with a self-adhesive backing tape. You remove the backing from that, and stick the tape to whatever you want. It (apparently) works a treat. However, the LED tape suppliers also want their tape to be used in harsher environments like bathrooms, kitchens, or outside, where there may be more ambient water around. So they’ve taken to encapsulating their LED tape in silicone. This provides both protection from splashes and dampness, and also a degree of physical protection too, without compromising the flexibility or the light output.

My test purchase was of the latter type. And the problem with this stuff is that the silicone both adds weight, and is a devil to adhere to.

To resolve the first issue, the backing tape really needs to be lot stronger than that on the normal tape. In my case it’s apparently branded 3M adhesive tape, but it’s clearly only just borderline strong enough to hold the LED tape in place when it’s stuck upside down without support. The LED tape and adhesive backing tape are pulling away from the cupboard in places. But worse, in other places the LED tape is pulling away from the adhesive backing tape, leaving just that stuck to the cupboard. Silicone is difficult stuff to stick to, and clearly this 3M tape is struggling.

Now, admittedly it’s hot weather at the moment – pushing 30c in my study – but these adhesive tapes are normally rated to 100+c, so I don’t think that’s the root issue here. It’s the weight and the silicone encapsulation that are causing the problems. So, what options do I have?

When I come to do the kitchen I could switch to un-encapsulated LED tape. That would solve the problem. But it’s not going to be as easy to keep clean, and it’s going to be exposed to steam etc. That doesn’t seem like a good solution. So I really need a better approach to mounting the silicone encapsulated LED tape.

My first thought was “better adhesive tape”. There are structural adhesive tapes around (usually called Very High Bond tapes) that can even be used as alternatives to spot welding. They’re not cheap, but I hoped that they might do the job. And there are some mid-range very high strength “professional” double-sided adhesive tapes that are used to make things like advertising signs that might be OK too. So I called a specialist adhesive tape supplier, Tapes Direct, and asked for some technical help. I ended up talking to the owner, and he wasn’t convinced that any of the normal tapes on the market will work well with silicone – not even the VHB stuff at £50 a roll. Kudos to him for not trying to sell me something that wouldn’t work too – proper customer service – I’ll definitely be using him next time I need some specialist tape. But for now it sounds as though adhesive tape is not the answer.

So the other thought is to stick it in place with Silicone sealant. I suspect this is one of those situations where it will be worth paying for a good quality sealant from someone like Dow Corning or Unibond. But the problem with this is that the good quality silicones all cure slowly, developing maximum strength over about 3 days. Which isn’t going to work upside down on a kitchen cupboard.

So the solution is to get some cheap angle or channel, and mount the LED tape onto that, using the silicone, and then mount that onto the kitchen cabinet (with something like screws) once the silicone has cured. You might be able to get away with plastic channel, but my preference is for some aluminium angle; it’s more rigid, so will mount more easily, and isn’t expensive from a wholesaler, even when bought in small quantities.

So later this week I’ll demount the LED tape in my study and build it up into what amounts to a custom light fitting. The trial continues. But of course, this is going to add to the cost. By self-building, I’m currently looking at about £20 a meter for this LED lighting. Adding aluminium angle & quality silicone sealant is going to raise that, perhaps to nearer £30 a meter. It’s still cost-effective, but the differential to something like these, at about £65/m is falling.

On the positive side though, as well as being cheaper, mine are still both brighter and easier to dim!

Let there be (dimmable) light!

I find that there is a point at which you need to switch from doing academic research to carrying out some practical experimentation. I reached that stage on Sunday evening while investigating whether LED “strip” lighting would be a good replacement for the traditional fluorescent work surface lighting.

I realised that I didn’t have any feel for how bright any of these tapes actually were, or if their characteristics would make them good or bad sources of light for work surfaces. I’d also had suggestions from contacts on Twitter that it was a lot harder to dim these strips than I was expecting.

But it transpires that in my study I happen to have a run of wall cupboards over my desk which closely mirrors a kitchen layout. The desk is much deeper, and the cupboards are mounted higher than they would be in a kitchen, but the principle is the same. If anything, my study would be a more challenging environment because of the higher mounting point and larger area to illuminate.

So I ordered 2 meters of moderately high output “dimmable” LED strip; this is built with 60 cool-white SMD 5050 LEDs per meter of tape, operating at 12v and drawing a little under 15w a meter. It’s all encapsulated in a silicone coating, and backed with a 3M self-adhesive coating. I added a 33w “TRIAC dimmable” LED driver that someone had reviewed as working successfully for them, and a Varilight V-Pro low power dimmer switch.

The advantage of that dimmer switch is that it can run with a minimum load of only 10w, unlike normal dimmer switches that usually require a minimum load of 40w or more. It’s also a “smart” dimmer switch, where the mode (leading/trailing edge dimming) and minimum brightness point can be “programmed” into the switch.

I’ve just set it all up “loose” on my desk, and it works quite well. The LED strip is very bright; more than sufficient to light a work surface under a kitchen cabinet. In fact, for my immediate “test” application in my study, it would be too bright without the dimmer.

The dimmer works very well in trailing edge mode. It’s completely silent at minimum and maximum brightness points, with only a very slight buzz from the LED driver at the mid-point. Minimum brightness is (subjectively) about 25% of the maximum, and control of the light level within those extremes is very smooth. Perhaps the only issue is that turning the LEDs on with the dimmer at anything other than full brightness seems to take a second or two for the dimmer to fire everything up. Noticeable, but not necessarily a problem.

Out of interest, I also tried the dimmer in leading edge mode; it wasn’t a good experience. The LEDs did not dim very much, and the LED driver produced a much more noticeable buzzing noise. Trailing edge is definitely the way to go, at least for this set up.

So, the summary is that the LED strip tape is fine for my intended use. I will almost certainly need to be able to dim it, and I now know that is possible without resorting to expensive professional remote-controlled low-voltage dimming, even though it’s perhaps not as ultimately good. From what I can see of the problems with this type of solution (noise, failure to start the LEDs) are probably all related to the LED drivers, so finding ones that are known to work well is going to be the key to success.

Circuit to dim 12v flexible LED strip

Update: Having had this system properly wired into my study for the last few days, a small issue has arisen; the adhesive backing tape (which I think is probably double-sided adhesive tape that is pre-applied as part of the manufacturing process) is not proving man enough for the job. About 2 days after I initially applied the strip to the bottom of the overhead cupboards, it it started to peel off. Pressing it back into place makes it stick again for a while, but it’s definitely not a good long-term solution.

At the moment I’m looking for a better fixing system, but the fact that the tape is encapsulated in silicone is not helpful, as getting anything to adhere to it is problematic. My first thoughts are around using a silicone-based adhesive/sealant to stick it up … but it will need supporting in place while everything cures, which could be interesting. More research required.