From: Eric Sandeen <sandeen@redhat.com> Date: Thu, 11 Sep 2008 12:08:47 -0500 Subject: [fs] implement fallocate syscall Message-id: 48C9509F.9060406@redhat.com O-Subject: [PATCH RHEL5.3 UPDATED 2] fs: implement fallocate syscall Bugzilla: 450566 RH-Acked-by: Josef Bacik <jbacik@redhat.com> RH-Acked-by: Rik van Riel <riel@redhat.com> RH-Acked-by: Rik van Riel <riel@redhat.com> RH-Acked-by: Peter Staubach <staubach@redhat.com> This is for Bug 450566 - FEAT: RHEL5.3 backport fallocate syscall This implements the new syscall sys_fallocate, which can preallocate space in a filesystem without requiring to pre-write all blocks, similar to what XFS has had as an ioctl for a while. ext4 is also able to make use of this, which is what motivates this patch. It's been built through brew (http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1425458) and tested in x86, x86_64, ppc, ia64, and s390. s390 failed, but due to more generic ext4 issues on that arch that I'm still working out. The patch below includes several commits, first adding BH_Unwritten to the buffer head flags set; this indicates an allocated but not initialized block on disk. After that comes the syscall itself and the various architecture wire-ups, only for the architectures we ship in RHEL. For new upstream syscall slots in between, it adds NI_SYSCALL etc. This patch digresses just a bit from upstream commits for KABI reasons. First of all, adding a new buffer head flag - we cannot move BH_PrivateStart, so I added BH_Unwritten at the top bit (31), assuming that any 3rd party FS will start at BH_PrivateStart and move up; no in-tree filesystem, at least, uses so many flags that it will get anywhere near slot 31. Also, we have to add *fallocate to struct inode_operations, so that is wrapped in the __GENKSYMS__ trick. Then to be able to know if it's safe to test for the presence of ->fallocate, I've added a new fs flag FS_HAS_FALLOCATE so that the vfs will know it's safe to test past the "normal" size of inode_operations that may be present in 3rd party fs modules. *** UPDATED *** moved FS_HAS_FALLOCATE down per Rik's suggestion. *** UPDATED 2 *** Address Mikulas' point about signed overflow - not yet upstream but I super-swear I'll get it there. :) I've also included a HAVE_FALLOCATE define in fs.h so that out-of-tree filesystems can test whether this kernel supports the fallocate syscall at build time. Thanks, -Eric ===================== From: David Chinner <dgc@sgi.com> Date: Mon, 12 Feb 2007 08:51:41 +0000 (-0800) Subject: [PATCH] Make BH_Unwritten a first class bufferhead flag V2 X-Git-Tag: v2.6.21-rc1~912~215 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=33a266dda9fbbe72dd978a451a8ee33c59da5e9c [PATCH] Make BH_Unwritten a first class bufferhead flag V2 Currently, XFS uses BH_PrivateStart for flagging unwritten extent state in a bufferhead. Recently, I found the long standing mmap/unwritten extent conversion bug, and it was to do with partial page invalidation not clearing the unwritten flag from bufferheads attached to the page but beyond EOF. See here for a full explaination: http://oss.sgi.com/archives/xfs/2006-12/msg00196.html The solution I have checked into the XFS dev tree involves duplicating code from block_invalidatepage to clear the unwritten flag from the bufferhead(s), and then calling block_invalidatepage() to do the rest. Christoph suggested that this would be better solved by pushing the unwritten flag into the common buffer head flags and just adding the call to discard_buffer(): http://oss.sgi.com/archives/xfs/2006-12/msg00239.html The following patch makes BH_Unwritten a first class citizen. Signed-off-by: Dave Chinner <dgc@sgi.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 659aeee..583d25d 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -326,3 +326,9 @@ ENTRY(sys_call_table) .long sys_vmsplice .long sys_move_pages .long sys_getcpu + .long sys_ni_syscall /* sys_epoll_pwait */ + .long sys_ni_syscall /* 320 */ /* sys_utimensat */ + .long sys_ni_syscall /* sys_signalfd */ + .long sys_ni_syscall /* sys_timerfd_create */ + .long sys_ni_syscall /* sys_eventfd */ + .long sys_fallocate diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S index b3dc97f..a4da084 100644 --- a/arch/ia64/kernel/entry.S +++ b/arch/ia64/kernel/entry.S @@ -1598,7 +1598,7 @@ sys_call_table: data8 sys_sync_file_range // 1300 data8 sys_tee data8 sys_vmsplice - data8 sys_ni_syscall // reserved for move_pages + data8 sys_fallocate data8 sys_getcpu .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c index 4b3b2af..537d936 100644 --- a/arch/powerpc/kernel/sys_ppc32.c +++ b/arch/powerpc/kernel/sys_ppc32.c @@ -838,6 +838,13 @@ asmlinkage int compat_sys_truncate64(const char __user * path, u32 reg4, return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, + u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, + ((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S index b3b3389..236f1ef 100644 --- a/arch/s390/kernel/compat_wrapper.S +++ b/arch/s390/kernel/compat_wrapper.S @@ -1658,3 +1658,13 @@ compat_sys_vmsplice_wrapper: llgfr %r4,%r4 # unsigned int llgfr %r5,%r5 # unsigned int jg compat_sys_vmsplice + + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr %r2,%r2 # int + lgfr %r3,%r3 # int + sllg %r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg %r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate diff --git a/arch/s390/kernel/sys_s390.c b/arch/s390/kernel/sys_s390.c index e351780..6b39d17 100644 --- a/arch/s390/kernel/sys_s390.c +++ b/arch/s390/kernel/sys_s390.c @@ -266,3 +266,22 @@ s390_fadvise64_64(struct fadvise64_64_args __user *args) return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice); } +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + return sys_fallocate(fd, mode, offset, ((u64)len_high << 32) | len_low); +} +#endif diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S index 93be1d5..4f7bc36 100644 --- a/arch/s390/kernel/syscalls.S +++ b/arch/s390/kernel/syscalls.S @@ -318,3 +318,9 @@ SYSCALL(sys_splice,sys_splice,sys_splice_wrapper) SYSCALL(sys_sync_file_range,sys_sync_file_range,sys_sync_file_range_wrapper) SYSCALL(sys_tee,sys_tee,sys_tee_wrapper) SYSCALL(sys_vmsplice,sys_vmsplice,compat_sys_vmsplice_wrapper) +NI_SYSCALL /* 310 sys_move_pages */ +NI_SYSCALL /* 311 sys_getcpu */ +NI_SYSCALL /* 312 sys_epoll_pwait */ +NI_SYSCALL /* 313 sys_utimes */ +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) + diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 02a050a..7a4b2bc 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -730,8 +730,14 @@ ia32_sys_call_table: .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad quiet_ni_syscall /* sys_epoll_pwait */ + .quad quiet_ni_syscall /* 320 */ /* compat_sys_utimensat */ + .quad quiet_ni_syscall /* compat_sys_signalfd */ + .quad quiet_ni_syscall /* sys_timerfd_create */ + .quad quiet_ni_syscall /* sys_eventd */ + .quad sys32_fallocate ia32_syscall_end: diff --git a/arch/x86_64/ia32/sys_ia32.c b/arch/x86_64/ia32/sys_ia32.c index a07fe80..8d967d0 100644 --- a/arch/x86_64/ia32/sys_ia32.c +++ b/arch/x86_64/ia32/sys_ia32.c @@ -915,3 +915,10 @@ long sys32_lookup_dcookie(u32 addr_low, u32 addr_high, return sys_lookup_dcookie(((u64)addr_high << 32) | addr_low, buf, len); } +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo, + unsigned offset_hi, unsigned len_lo, + unsigned len_hi) +{ + return sys_fallocate(fd, mode, ((u64)offset_hi << 32) | offset_lo, + ((u64)len_hi << 32) | len_lo); +} diff --git a/fs/buffer.c b/fs/buffer.c index 9b33385..091319e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1582,6 +1582,7 @@ static void discard_buffer(struct buffer_head * bh) clear_buffer_req(bh); clear_buffer_new(bh); clear_buffer_delay(bh); + clear_buffer_unwritten(bh); unlock_buffer(bh); } @@ -2001,6 +2002,7 @@ static int __block_prepare_write(struct inode *inode, struct page *page, continue; } if (!buffer_uptodate(bh) && !buffer_delay(bh) && + !buffer_unwritten(bh) && (block_start < from || block_end > to)) { ll_rw_block(READ, 1, &bh); *wait_bh++=bh; @@ -2720,7 +2722,7 @@ int block_truncate_page(struct address_space *mapping, if (PageUptodate(page)) set_buffer_uptodate(bh); - if (!buffer_uptodate(bh) && !buffer_delay(bh)) { + if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) { err = -EIO; ll_rw_block(READ, 1, &bh); wait_on_buffer(bh); diff --git a/fs/open.c b/fs/open.c index 793eb22..80435ef 100644 --- a/fs/open.c +++ b/fs/open.c @@ -28,6 +28,7 @@ #include <linux/syscalls.h> #include <linux/rcupdate.h> #include <linux/audit.h> +#include <linux/falloc.h> #include <asm/unistd.h> @@ -353,6 +354,74 @@ asmlinkage long sys_ftruncate64(unsigned int fd, loff_t length) } #endif +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode && !(mode & FALLOC_FL_KEEP_SIZE)) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + /* + * Revalidate the write permissions, in case security policy has + * changed since the files were opened. + */ + ret = security_file_permission(file, MAY_WRITE); + if (ret) + goto out_fput; + + inode = file->f_dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + /* + * Let individual file system decide if it supports preallocation + * for directories or not. + */ + if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + /* Check for wrap through zero too */ + if ((loff_t)((unsigned long long)offset + (unsigned long long)len) < 0 || + (offset + len) > inode->i_sb->s_maxbytes) + goto out_fput; + + /* + * KABI trick; filesystem implementing ->fallocate must + * set FS_HAS_FALLOCATE in fs_flags so we know it's safe to test + */ + if (!(inode->i_sb->s_type->fs_flags & FS_HAS_FALLOCATE)) { + ret = -EOPNOTSUPP; + goto out_fput; + } + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -EOPNOTSUPP; + +out_fput: + fput(file); +out: + return ret; +} + #ifdef __ARCH_WANT_SYS_UTIME /* diff --git a/fs/xfs/linux-2.6/xfs_linux.h b/fs/xfs/linux-2.6/xfs_linux.h index a13f75c..484dba1 100644 --- a/fs/xfs/linux-2.6/xfs_linux.h +++ b/fs/xfs/linux-2.6/xfs_linux.h @@ -109,16 +109,6 @@ #undef HAVE_PERCPU_SB /* per cpu superblock counters are a 2.6 feature */ #endif -/* - * State flag for unwritten extent buffers. - * - * We need to be able to distinguish between these and delayed - * allocate buffers within XFS. The generic IO path code does - * not need to distinguish - we use the BH_Delay flag for both - * delalloc and these ondisk-uninitialised buffers. - */ -BUFFER_FNS(PrivateStart, unwritten); - #define restricted_chown xfs_params.restrict_chown.val #define irix_sgid_inherit xfs_params.sgid_inherit.val #define irix_symlink_mode xfs_params.symlink_mode.val diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index ab5b845..73b0e90 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -324,10 +324,15 @@ #define __NR_vmsplice 316 #define __NR_move_pages 317 #define __NR_getcpu 318 - +/* #define __NR_epoll_pwait 319 */ +/* #define __NR_utimensat 320 */ +/* #define __NR_signalfd 321 */ +/* #define __NR_timerfd_create 322 */ +/* #define __NR_eventfd 323 */ +#define __NR_fallocate 324 #ifdef __KERNEL__ -#define NR_syscalls 319 +#define NR_syscalls 325 #ifndef __KERNEL_SYSCALLS_NO_ERRNO__ /* diff --git a/include/asm-ia64/unistd.h b/include/asm-ia64/unistd.h index e925cf6..47f6932 100644 --- a/include/asm-ia64/unistd.h +++ b/include/asm-ia64/unistd.h @@ -291,7 +291,7 @@ #define __NR_sync_file_range 1300 #define __NR_tee 1301 #define __NR_vmsplice 1302 -/* 1303 reserved for move_pages */ +#define __NR_fallocate 1303 #define __NR_getcpu 1304 #ifdef __KERNEL__ diff --git a/include/asm-powerpc/systbl.h b/include/asm-powerpc/systbl.h index e4f8cdc..4d56ff6 100644 --- a/include/asm-powerpc/systbl.h +++ b/include/asm-powerpc/systbl.h @@ -304,3 +304,12 @@ SYSCALL_SPU(fchmodat) SYSCALL_SPU(faccessat) COMPAT_SYS_SPU(get_robust_list) COMPAT_SYS_SPU(set_robust_list) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +COMPAT_SYS(fallocate) diff --git a/include/asm-powerpc/unistd.h b/include/asm-powerpc/unistd.h index eb66eae..f838c7b 100644 --- a/include/asm-powerpc/unistd.h +++ b/include/asm-powerpc/unistd.h @@ -323,10 +323,19 @@ #define __NR_faccessat 298 #define __NR_get_robust_list 299 #define __NR_set_robust_list 300 +/* #define __NR_move_pages 301 */ +/* #define __NR_getcpu 302 */ +/* #define __NR_epoll_pwait 303 */ +/* #define __NR_utimensat 304 */ +/* #define __NR_signalfd 305 */ +/* #define __NR_timerfd_create 306 */ +/* #define __NR_eventfd 307 */ +/* #define __NR_sync_file_range2 308 */ +#define __NR_fallocate 309 #ifdef __KERNEL__ -#define __NR_syscalls 301 +#define __NR_syscalls 310 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls diff --git a/include/asm-s390/unistd.h b/include/asm-s390/unistd.h index aa7a243..c39b90e 100644 --- a/include/asm-s390/unistd.h +++ b/include/asm-s390/unistd.h @@ -302,8 +302,13 @@ #define __NR_sync_file_range 307 #define __NR_tee 308 #define __NR_vmsplice 309 +/* Number 310 is reserved for sys_move_pages */ +/* Number 311 is reserved for sys_getcpu */ +/* Number 312 is reserved for sys_epoll_pwait */ +/* Number 313 is reserved for sys_utimes */ +#define __NR_fallocate 314 -#define NR_syscalls 310 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 9523e40..7d17bba 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -627,10 +627,22 @@ __SYSCALL(__NR_sync_file_range, sys_sync_file_range) __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_utimensat 280 +__SYSCALL(__NR_utimensat, sys_ni_syscall) +#define __NR_epoll_pwait 281 +__SYSCALL(__NR_epoll_pwait, sys_ni_syscall) +#define __NR_signalfd 282 +__SYSCALL(__NR_signalfd, sys_ni_syscall) +#define __NR_timerfd_create 283 +__SYSCALL(__NR_timerfd_create, sys_ni_syscall) +#define __NR_eventfd 284 +__SYSCALL(__NR_eventfd, sys_ni_syscall) +#define __NR_fallocate 285 +__SYSCALL(__NR_fallocate, sys_fallocate) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_fallocate #ifndef __NO_STUBS diff --git a/include/linux/Kbuild b/include/linux/Kbuild index 05d7251..e910a49 100644 --- a/include/linux/Kbuild +++ b/include/linux/Kbuild @@ -59,6 +59,7 @@ header-y += elf-fdpic.h header-y += elf.h header-y += elf-em.h header-y += fadvise.h +header-y += falloc.h header-y += fd.h header-y += fdreg.h header-y += ftape-header-segment.h diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 41c7fed..d10a271 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -32,10 +32,11 @@ enum bh_state_bits { BH_Write_EIO, /* I/O error on write */ BH_Ordered, /* ordered write */ BH_Eopnotsupp, /* operation not supported (barrier) */ - BH_PrivateStart,/* not a state bit, but the first bit available * for private allocation by other entities */ + BH_Unwritten=31,/* Buffer is allocated on disk but not written */ + }; #define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512) @@ -122,6 +123,7 @@ BUFFER_FNS(Boundary, boundary) BUFFER_FNS(Write_EIO, write_io_error) BUFFER_FNS(Ordered, ordered) BUFFER_FNS(Eopnotsupp, eopnotsupp) +BUFFER_FNS(Unwritten, unwritten) #define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK) #define touch_buffer(bh) mark_page_accessed(bh->b_page) diff --git a/include/linux/falloc.h b/include/linux/falloc.h new file mode 100644 index 0000000..8e912ab --- /dev/null +++ b/include/linux/falloc.h @@ -0,0 +1,6 @@ +#ifndef _FALLOC_H_ +#define _FALLOC_H_ + +#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ + +#endif /* _FALLOC_H_ */ diff --git a/include/linux/fs.h b/include/linux/fs.h index f6b7e12..dafa46f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -91,6 +91,8 @@ extern int dir_notify_enable; /* public flags for file_system_type */ #define FS_REQUIRES_DEV 1 #define FS_BINARY_MOUNTDATA 2 +#define HAVE_FALLOCATE +#define FS_HAS_FALLOCATE 4 #define FS_REVAL_DOT 16384 /* Check the paths ".", ".." for staleness */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() * during rename() internally. @@ -1160,6 +1162,10 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); +#ifndef __GENKSYMS__ + long (*fallocate)(struct inode *inode, int mode, loff_t offset, + loff_t len); +#endif }; struct seq_file; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index d009156..b258284 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -599,5 +599,5 @@ asmlinkage long sys_get_robust_list(int pid, asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); asmlinkage long sys_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *cache); - +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); #endif