
Daniel is a system programmer & web developer with 10 years of experience...
system programmer & web developer
coder . cl
a journey in software development...
Re: vmsplice() + SPLICE_F_GIFT flag ??
by Daniel Molina Wegener on 27-08-2009
On the post from Eduardo Silva: "vmsplice() + SPLICE_F_GIFT flag ??", he’s asking about the problem of EINVAL errno value on calling vmsplice(2) system call on Linux systems.
does anybody know why the linux syscall vmsplice() doesn’t work with the SPLICE_F_GIFT flag ?, per manual page it should be supported but i get the following error:
My answer is as follows — seems that the blog of Eduardo is cutting the paragraphs — is quiet simple, and I’m doing the reasoning as C programmer regarding the manual page, without looking at the code.
We have the next declaration: long vmsplice(int fd, const struct iovec *iov, unsigned long nr_segs, unsigned int flags); According to the manual page, an EINVAL or "Invalid Argument" error is generated in two cases: 1. The number of chunks to supply to the kernel as nr_segs number of blocks is greater than IOV_MAX. If IOV_MAX < nr_segs EINVAL is thrown. Check for the value of IOV_MAX and nr_segs. 2. io_base addresses are not aligned in the iovec array: If we have iovec[4] as: iovec[0]->io_base equal to 0×080ed000; iovec[1]->io_base equal to 0×080ee024; (aligned in 1024 bytes) iovec[2]->io_base equal to 0×080ef048; (aligned in 1024 bytes) iovec[3]->io_base equal to 0×080f0090; (aligned in 1048 bytes)* The alignment of 1048 bytes differs from the base alignment of 1024 bytes, and EINVAL is thrown. You must ensure that the GIFT blocks are aligned.
The complete explanation on the answer can redacted in a few lines. The declaration of the vmsplice(2) system call receives a file descriptor fd, an array of iovec structures iov, the number of iovec structures in the array nr_segs and the flags which are passed to the system call. The subject is the EINVAL, aka “Invalid Argument” errno value. For the first instance we have that nr_segs, or number of segments argument, can not be greater than IOV_MAX. If it is greater than this value, the EINVAL errrno value is raised. The number of segments is limited to IOV_MAX, we can not gift more segments than IOV_MAX permits. The IOV_MAX macro is defined in the <limits.h> header.
The second issue with the SPLICE_F_GIFT flag, is the fact that memory chunks that we gift to the kernel must be a range of byte aligned bytes with the same size and continuous blocks. In the example we have three continuous blocks of 1024 bytes, but the last block, the 4th of them, is aligned at the byte 1048 with a hole of 24 bytes between the two last blocks. This will cause EINVAL errno value to be raised. Other interesting case, are blocks of different size, they also must be byte aligned. In other case, one byte of difference in the memory alignment, will cause the EINVAL errno value to be raised.
The EBADF errno value is also interesting. You must create a pipe through pipe(2) or socketpair(2) system calls or the popen(3) C library routine. The file descriptor that those functions returns must be passed to the vmsplice(2) system call. The file descriptor fd must refer to a pipe. Finally on success, the system call returns the number of bytes transferred.
Now I’m thinking on threaded applications. What happens to those memory blocks pointed by the iovec array when those bytes are protected to other threads through the pthread_key_create(P) system call on Linux systems?. This may be an interesting subject to study. I think that I can discover interesting things on key implementation on POSIX threads on Linux systems, mainly because the Linux kernel do not have a separate implementation for threads. For the Linux kernel, everything is a procces, even if it is a user process or a process thread. I think that this issue, in conjunction with the O(1) Linux scheduler will be the matter for my next studying chapter.


Note that for using SPLICE_F_GIFT, the chunks must be page aligned, i. e. every chunk must be a contiguous block of full pages. Page size normally is 4 KiB (0×1000 in Hex). So, the following example is properly page aligned:
iovec[0]->io_base==0×080ed000, iovec[0]->io_len==0×1000,
iovec[1]->io_base==0×080ef000,
iovec[1]->io_len==0×2000,
iovec[2]->io_base==0×080f4000, iovec[2]->io_len==0×1000,
note that the chunks have different sizes, but all of them multiples of the page size (0×1000). Also, there is a 12 KiB gap between the second and the last chunk, but all chunks start at a page-aligned address (last three hexdigits are zero).
Compare instead the following example:
iovec[0]->io_base==0×080ed000, iovec[0]->io_len==0×400,
iovec[1]->io_base==0×080ed400,
iovec[1]->io_len==0×400,
iovec[2]->io_base==0×080fd800, iovec[2]->io_len==0×400,
iovec[2]->io_base==0×080fdc00, iovec[2]->io_len==0×400,
All four chunks from a contiguous range of 4 KiB, and each one is 1 KiB in length. Also, the first block is page-aligned, so that the four chunks together represent exactly one page in virtual memory. However, the individual chunks are not properly page-aligned: iovec[0] does not end on a page boundary, iovec[3] does not begin on a page boundary, and iovec[1] and iovec[2] do neither begin nor end on a page boundary.
In short: every chunk must begin and end on a page boundary (hex address ending on 000).
Thank you very much Carsten. I will communicate it to Eduardo.
Best Regards,
D