Jeff Epler's blog

1 March 2016, 2:56 UTC

Literal copying of GPL code into ZFS on Linux

Recently, Canonical has announced that they plan to ship zfsonlinux in compiled form from their "main" repository. They rely on advice from their lawyers that the compiled form of zfs.ko, the kernel loadable module implementing zfs, is not a derivative of the linux kernel. This is important, because if zfsonlinux is a derivative of both GPL code and CDDL code, the two licenses impose incompatible requirements on the compiled form of the software, and distribution is not possible. (This is in contrast to Debian's plan to ship zfs in source form only in their next release (and also in the somewhat ghettoized "contrib" section rather than "main"); this is considered safer by many because the GPL's restrictions on code in source format are less stringent than code in object code or executable format)

There are two main ways that zfsonlinux's zfs.ko could be a derivative work of linux: First, and the option that many commenters have concentrated on, is if the act of compiling zfs also includes copyrighted portions of linux in a way that activates the GPL's restrictions because the combined work is "a work based on [the Linux kernel]". However, there is a second way that zfsonlinux could be subject to the GPL and CDDL simultaneously: if source code licensed exclusively under the GPL has been copied into the zfsonlinux source tree.

In this article, I collect some samples of code from zfsonlinux and compare those samples to code committed earlier to linux. I present the samples using a customized word diff algorithm. To reproduce my results, you will need to manually find the associated refs in linux.git and zfsonlinux/zfs.git and use your own word-diff algorithm.

I don't know whether I've found the worst of the code copying; I found the two examples given below pretty quickly, imagined that I had found just the tip of the iceberg, and then decided to write this article before looking for more. But I haven't found any more that are as clear-cut as these two examples are in my mind, while also being over ten lines long. Beginner's luck? Or did I find the 50 copied lines out of 260,000 lines in a half hour and that's the whole story?

Because I cannot speak for the authors of the code committed to linux.git which appear to be duplicated in zfsonlinux, it is impossible for me to say whether they have some private arrangement with the code authors that permit the use of the code snippets under the CDDL. If not, it's a legal question whether the amount of code copied is relevant under copyright law. In this area, I imagine caution is best; after all, in the US, Google's copying of the 9-line "RangeCheck" function (out of how many million lines in the JRE?) has been determined not to be de minimis, though in a new trial Google will assert a fair use defense regarding that copying.

(note: some popular RSS readers do not show the color coding of insertions and deletions)

Snippet A: linux at commit 5f99f4e79abc64ed9d93a4b0158b21c64ff7f478

Snippet B: zfsonlinux at commit 0f37d0c8bed442dd0d2c1b1dddd68653fa6eec66

Common text

static inline bool dir_emit(struct dir_context *ctx,
                           const char *name, int namelen,
                           u64
    uint64_t ino, unsigned type)
{
       return ctx->actor(ctx->dirent, name, namelen, ctx->pos, ino, type) == 0;
}
static inline bool dir_emit_dot(struct file *file, struct dir_context *ctx)
{
       return ctx->actor(ctx->dirent, ".", 1, ctx->pos,
                         file->f_path.dentry->d_inode->i_ino, DT_DIR) == 0;
}
static inline bool dir_emit_dotdot(struct file *file, struct dir_context *ctx)
{
       return ctx->actor(ctx->dirent, "..", 2, ctx->pos,
                         parent_ino(file->f_path.dentry), DT_DIR) == 0;
}
static inline bool dir_emit_dots(struct file *file, struct dir_context *ctx)
{
       if (ctx->pos == 0) {
               if (!dir_emit_dot(file, ctx))
                       return false;
               ctx->pos = 1;
       }
       if (ctx->pos == 1) {
               if (!dir_emit_dotdot(file, ctx))
                       return false;
               ctx->pos = 2;
       }
       return true;
}

Snippet A: linux at c3c532061e46156e8aab1268f38d66cfb63aeb2d

Snippet B: zfsonlinux at 5547c2f1bf49802835fd6c52f15115ba344a2a8b

Common text

static inline int bdi_setup_and_register(struct backing_dev_info *bdi, char *name,
                           unsigned int cap)
{
        char tmp[32];
        int errerror;

        bdi->name = name;
        bdi->capabilities = cap;
        errerror = bdi_init(bdi);
        if (errerror)
                return err;(error);

        sprintf(tmp, "%.28s%s", name, "-%d");
        errerror = bdi_register(bdi, NULL, tmp, atomic_long_inc_return(&bdi_seqzfs_bdi_seq));
        if (errerror) {
                bdi_destroy(bdi);
                return err;(error);
        }

        return 0;(error);
}

[permalink]

13 December 2013, 23:26 UTC

Benchmarking ungeli on real data

Since the goal of this little project is to actually read my geli-encrypted zfs filesystems on a Linux laptop, I had to get a USB enclosure that supports drives bigger than 2TB; I also got a model which supports USB 3.0. The news I have is:

Ungeli and zfs-on-linux work for this task. I was able to read files and verify that their content was the same as on the Debian kFreeBSD system.

The raw disk I tested with (WDC WD30 EZRX-00DC0B0) gets ~155MiB/s at the start, ~120MiB at the middle, and ~75MiB/s at the end of the first partition according to zcav. Even though ungeli has had no serious attempt at optimization, it achieves over 90% of this read rate when zcav is run on /dev/nbd0 instead of /dev/sdb1, up to 150MiB/s at the start of the device while consuming about 50%CPU.

(My CPU does have AES instructions but I don't know for certain whether my OpenSSL uses them the way I've written the software. I do use the envelope functions, so I expect that it will. "openssl speed -evp aes-128-xts" gets over 2GB/s on 1024- and 8192-byte blocks)

Unfortunately, zfs read speeds are not that hot. Running md5sum on 2500 files totalling 2GB proceeded at an average of less than 35MB/s. I don't have a figure for this specific device when it's attached to (k)FreeBSD via sata, but I did note that the same disk scrubs at 90MB/s. On the other hand, doing a similar test on my kFreeBSD machine (but on the raidz pool I have handy, not a volume made from a single disk) I also md5sum at about 35MB/s, so maybe this is simply what zfs performance is.

All in all, I'm simply happy to know that I can now read my backups on either Linux or (k)FreeBSD.

[permalink]

25 November 2013, 2:31 UTC

Encrypted ZFS for off-site backups

As I recently discussed, I use zfs replication for my off-site backups, manually moving volumes from my home to a second location on a semi-regular schedule.

Of course, I would rather that if one of these drives were stolen or lost that the thief not have a copy of all my data. Therefore, I use geli to encrypt the entire zpool.

My ZFS replication script

On my new Debian GNU/kFreeBSD system, my backup strategy has changed. On the previous system, I relied on incremental dumps and a DAT160 tape drive, which has an 80GB uncompressed capacity. When you have a few hundred gigabytes of photos to back up, this is an inconvenient solution. On the new system, I am using multiple removable 3TB hard drives and a set of scripts built around zfs send/receive.

Cron runs the rep.py script 4 times a day, which does a zfs send | zfs receive pipeline for each filesystem to be backed up. On a semi-regular basis (I haven't yet decided on what schedule to do this on; with tape backups I did it less than once a month even though it was comparatively easier), I remove the drive to an off-site location, return the other drive from off-site, and insert it. (There's also fiddling with zpool import/export, of course)

The rep.py script relies on the zfs python module, also of my own creation. This module has facilities for inspecting and interacting with zfs filesystems, e.g., to list filesystems and snapshots, to create and destroy snapshots, and to run replication pipelines.

The rep.py script needs to be customized for your system. Customization items are:

TARGETS = ['bpool', 'cpool']
SRCS = ['mpool', 'rpool']

"SRCS" is a list of zpools which are replicated. "TARGETS" is a list of zpools to which backups are replicated. The first available pool out of TARGETS is chosen. (so if more than one TARGET is inserted, only one will ever be used)

You can designate individual filesystems as not replicated by setting the user property net.unpy.zreplicator:skip to the exact string "1", i.e.,

zfs set net.unpy.zreplicator:skip=1 examplepool/junkfiles

Files currently attached to this page:

rep.py	1.5kB
zfs.py	12.7kB

License: GPLv2+

[permalink]

Jeff Epler's blog

About me

BlurBlog

Literal copying of GPL code into ZFS on Linux

Benchmarking ungeli on real data

Encrypted ZFS for off-site backups

My ZFS replication script