分享

更多

   

Linux: Accessing Files With O_DIRECT | KernelTrap

2009-01-21  fort


A thread on the lkml began with a query about using O_DIRECT when opening a file. An early white paper written by Andrea Arcangeli [interview] to describe the O_DIRECT patch before it was merged into the 2.4 kernel explains, "with O_DIRECT the kernel will do DMA directly from/to the physical memory pointed [to] by the userspace buffer passed as [a] parameter to the read/write syscalls. So there will be no CPU and memory bandwidth spent in the copies between userspace memory and kernel cache, and there will be no CPU time spent in kernel in the management of the cache (like cache lookups, per-page locks etc..)." Linux creator Linus Torvalds was quick to reply that despite all the claims there is no good reason for mounting files with O_DIRECT, suggesting that interfaces like madvise() and posix_fadvise() should be used instead, "there really is no valid reason for EVER using O_DIRECT. You need a buffer whatever IO you do, and it might as well be the page cache. There are better ways to control the page cache than play games and think that a page cache isn't necessary."

Linus went on to explain, "the only reason O_DIRECT exists is because database people are too used to it, because other OS's haven't had enough taste to tell them to do it right, so they've historically hacked their OS to get out of the way. As a result, our madvise and/or posix_fadvise interfaces may not be all that strong, because people sadly don't use them that much. It's a sad example of a totally broken interface (O_DIRECT) resulting in better interfaces not getting used, and then not getting as much development effort put into them." To further underscore his point, he humorously added:

"The whole notion of "direct IO" is totally brain damaged. Just say no.
This is your brain: O
This is your brain on O_DIRECT: .

Any questions?

 


From: Aubrey [email blocked]
To: "Hua Zhong" [email blocked]O_
Subject: O_DIRECT question
Date:	Thu, 11 Jan 2007 10:57:06 +0800
Hi all,
Opening file with O_DIRECT flag can do the un-buffered read/write access.
So if I need un-buffered access, I have to change all of my
applications to add this flag. What's more, Some scripts like "cp
oldfile newfile" still use pagecache and buffer.
Now, my question is, is there a existing way to mount a filesystem
with O_DIRECT flag? so that I don't need to change anything in my
system. If there is no option so far, What is the right way to achieve
my purpose?
Thanks a lot.
-Aubrey

From: Linus Torvalds [email blocked] Subject: Re: O_DIRECT question Date: Wed, 10 Jan 2007 19:05:30 -0800 (PST) On Thu, 11 Jan 2007, Aubrey wrote: > > Now, my question is, is there a existing way to mount a filesystem > with O_DIRECT flag? so that I don't need to change anything in my > system. If there is no option so far, What is the right way to achieve > my purpose? The right way to do it is to just not use O_DIRECT. The whole notion of "direct IO" is totally braindamaged. Just say no. This is your brain: O This is your brain on O_DIRECT: . Any questions? I should have fought back harder. There really is no valid reason for EVER using O_DIRECT. You need a buffer whatever IO you do, and it might as well be the page cache. There are better ways to control the page cache than play games and think that a page cache isn't necessary. So don't use O_DIRECT. Use things like madvise() and posix_fadvise() instead. Linus
From: Linus Torvalds [email blocked] Subject: Re: O_DIRECT question Date: Wed, 10 Jan 2007 19:15:48 -0800 (PST) On Wed, 10 Jan 2007, Linus Torvalds wrote: > > So don't use O_DIRECT. Use things like madvise() and posix_fadvise() > instead. Side note: the only reason O_DIRECT exists is because database people are too used to it, because other OS's haven't had enough taste to tell them to do it right, so they've historically hacked their OS to get out of the way. As a result, our madvise and/or posix_fadvise interfaces may not be all that strong, because people sadly don't use them that much. It's a sad example of a totally broken interface (O_DIRECT) resulting in better interfaces not getting used, and then not getting as much development effort put into them. So O_DIRECT not only is a total disaster from a design standpoint (just look at all the crap it results in), it also indirectly has hurt better interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and clean interface to make sure we don't pollute memory unnecessarily with cached pages after they are all done) ends up being a no-op ;/ Sad. And it's one of those self-fulfilling prophecies. Still, I hope some day we can just rip the damn disaster out. Linus
From: Andrew Morton [email blocked] Subject: Re: O_DIRECT question Date: Wed, 10 Jan 2007 20:51:57 -0800 On Thu, 11 Jan 2007 10:57:06 +0800 Aubrey [email blocked] wrote: > Hi all, > > Opening file with O_DIRECT flag can do the un-buffered read/write access. > So if I need un-buffered access, I have to change all of my > applications to add this flag. What's more, Some scripts like "cp > oldfile newfile" still use pagecache and buffer. > Now, my question is, is there a existing way to mount a filesystem > with O_DIRECT flag? so that I don't need to change anything in my > system. If there is no option so far, What is the right way to achieve > my purpose? Not possible, basically. O_DIRECT reads and writes must be aligned to the device's block size (usually 512 bytes) in memory addresses, file offsets and read/write request sizes. Very few applications will bother to do that and will hence fail if their files are automagically opened with O_DIRECT.



Related Links:

There _are_ genuine uses for O_DIRECT

January 11, 2007 - 10:42am
Anonymous (not verified)

If you are accessing a block device whose contents may change without being written through the typical kernel file I/O routines, then repeated reads will return the cached value and not the live value.
This can be seen when directly reading the LVM snapshot COW delta block device - if you read the COW block device without O_DIRECT, then write some data to the disk that causes the snapshot COW data to change, then re-read the COW block device without O_DIRECT then you will get the _cached_ values back and not the true current data. If you open the COW device with O_DIRECT, then it all works properly.

You can argue that the COW device should be marking the caches as dirty (somehow) if it changes its contents, but it didn't as of 2.6.16 so the O_DIRECT still had value at that point - indeed some functionality is impossible without it!

That the purpose of mmap,

January 12, 2007 - 10:04am
Anonymous (not verified)

That the purpose of mmap, isn't ?

Uhh, no

November 14, 2007 - 5:07am

I think he means *really* change without going through the kernel at all.

Like say you have some device emulating a filesystem that's being concurrently changed/updated by another machine but the kernel isn't handling locking. I can imagine this sort of thing occuring in some kind of crazy database setup.

The real question is what sort of interface do these other options offer. No one denies that O_DIRECT is better than nothing the question is just whether it can be done even better.

Old Discussion

January 12, 2007 - 12:26pm
Peter Zaitsev (not verified)

I think I've seen same discussion some 2-3 years ago.

From my standpoint O_DIRECT is best thing for the task we have right now. Databases need to manage their caches by themselves and bypass OS. I'm not sure how you can do madvise to force bypassing OS File cache at all including this extra copying etc.

Regarding mmap it still means OS caches things, not database plus it is not that easy to handle IO failures.

Does a filesystem *always*

January 12, 2007 - 4:48pm
Jon E (not verified)

Does a filesystem *always* know better than an application how to talk to the underlying devices? By no means! therefore - there exists a case for O_DIRECT (clear the way .. I'm coming through!!)

BTW - I don't know how many times I've seen this discussion, and if you talk to seasoned filesystem guys (and girls) .. you'll find that despite good intentions and best efforts - there's always a need to allow for some sort of directio since someone else may always come up with a better way of talking to a device and face it .. you might be getting in the way.

The FS has information that

January 13, 2007 - 6:09am
Anonymous (not verified)

The FS has information that the userspace program doesn't have. If you know how to "better talk to a device", then don't use a FS, but a raw partition. This discussion is only about O_DIRECT on files, not on partitions.

We still need o_direct...

January 15, 2007 - 5:52pm
jml (not verified)

I think we still need o_direct. Take the case where you're looking at a cluster heartbeat (think ocfs2, for example). Reading the heartbeat sectors without o_direct returns cached data per node, and thus you never see the updates. Open the device with o_direct and all works well. Is there another way to ensure that you're doing actual disk reads (vs. looking at cache)?

Sheesh... This discussion is

January 16, 2007 - 2:14pm
Anonymous (not verified)

Sheesh...

This discussion is only about O_DIRECT on FILES. It is NOT about O_DIRECT on DEVICES. What is so hard to understand about that?

I think Linus was talking

January 16, 2007 - 10:53am
Anonymous (not verified)

I think Linus was talking about shared mapping. That way your app shares buffer with kernel.

Unfortunately, even on AMD64 there is only 48 bits of address space, which seems to be not big enough for database people. Still, there is remap_file_pages, which should work ok even on 32-bit arches.

I think there is another reason, why database people love O_DIRECT. It is because they need precise control on data writeback. It is more convenient to write() data when you need it to be written, rather than play tricks with mlock/munlock().

That way your app shares

January 21, 2007 - 9:28am

That way your app shares buffer with kernel.

Sounds interesting, but how is that going to work on a file system with block sizes smaller than the physical page size?

It does not (or should not

January 21, 2007 - 2:09pm
Anonymous (not verified)

It does not (or should not (depending on implementation)) depend on FS block size. It depends on block device sector size.

Yes, mmap IO is always page alligned. On small writes it will be sub-optimal, but AFAIK DBMS don't use small blocks.

My point is, that if the

January 21, 2007 - 3:51pm

My point is, that if the file system works with allocations in units smaller than one page, then you cannot map that into user space. Assuming I mmap a 4KB region of a file, that could be 3 sparse sectors and 5 other sectors scattered over all of the device. How would you map this into user space?

Well, Linux support that.

January 22, 2007 - 3:56am
Anonymous (not verified)

Well, Linux support that. Probably, by reading that fragments into single page.

Removable storage is the

January 12, 2007 - 5:03pm
Anonymous (not verified)

Removable storage is the best argument for direct I/O. When I copy something to a USB hard drive or thumb drive, and the progress dialog says that the copy is done, I want to be able to just pull the plug and go on my merry way. I don't want to lose my data because I forgot to manually flush the write buffer. There shouldn't be a need to flush a buffer.

I don't care if the removable device operates slowly for lack of a cache. Written data can be cached for subsequent reads, but writes should be direct and immediate for these devices.

I don't think that has

January 12, 2007 - 5:37pm
Anonymous (not verified)

I don't think that has anything to do with O_DIRECT, which is an open() system call option for specific files, that the process has to use explicitly when they open a file. You're talking about using USB hard drives or flash memory devices in general, to store anything from any application. The best option available for that case is to mount the filesystem with the "sync" option.

sync is potentially dangerous

January 14, 2007 - 9:49am
rlj (not verified)

as far as i've understood, mounting flash memory with the 'sync' mount option (in particular FAT file systems i think) is a very bad idea, since every tiny write will be flushed out immediately (instead of being cached), resulting in hammering of the FATs which are at static positions on the partition. since flash memory has limited write cycles, the flash cells holding the FAT on the fs can quicky wear out and become permanently destroyed, resulting in a bricked memory stick (although you can possibly recreate the partition at another offset?).

i recommend reading the following thread on lkml about the issue: http://readlist.com/lists/vger.kernel.org/linux-kernel/22/111748.html

info on flash memory: http://en.wikipedia.org/wiki/Flash_memory

cheers
rlj

(Grandparent replying) Well,

January 14, 2007 - 1:57pm
Anonymous (not verified)

(Grandparent replying)

Well, you may be right. I don't mount them with the sync option, but I also don't mind waiting some seconds when I umount the partition. But if the original poster wants to umount the partition and inmediately be able to unplug the drive, there are not many options...

original poster

January 15, 2007 - 3:57pm
Anonymous (not verified)

Yeah, the OP is smoking $2 crack.

No operating system allows you to unplug USB drives without doing "unmount."
No, not even Windows, where there is that little "remove" thingy you need to click on.

This request is the equivalent of "I don't want to put my car into park, but I want it to stay where it is when I get out, even if it's moving at the time. See? See?"

Two solutions

January 22, 2007 - 3:22pm
Anonymous (not verified)

Two solutions:

  1. Don't use FAT on flash devices (or anything else for that matter)
  2. Use a flash device that does wear-leveling. Most memory cards do that.

Kernel 2.6.19 adds the

January 12, 2007 - 11:01pm
Anonymous (not verified)

Kernel 2.6.19 adds the "flush" mount option, which ensures that bytes written to files on a mounted filesystem have been flushed to the device by the time close(2) returns. That's what you're looking for.

It's like the "sync" option that doesn't make you wait for every block written.

Real uses for O_DIRECT that memory mapping cannot do (well)

January 12, 2007 - 6:45pm

There are a couple real uses for O_DIRECT that I know of that memory mapping cannot do, or at least not do so well.

One simple use is zero-ing partitions. With memory mapping you have to zero out (calling memset for example) every page as it gets mapped. And the first write on that page might well read the page in to get the other 4092 bytes you haven't written yet (because the paging system might not know you are planning to store more 0 words). I have a program I use that creates 256KB of all binary zeros, opens the device for write with O_DIRECT, and proceeds to write that same 256KB over and over up to the end of the partition.

Another more complex use involves the virtual ring buffer. That is a special ring buffer implemented as a group of pages of virtual memory with a mirror image of it immediately following it. It's an effective way to read and/or write data in a ring buffer without having to worry about the wraparound or spend the time copying tail data back to the head. Direct access to all data is always available contiguously. Once you have data that is to be written from the virtual ring buffer, you may not want to copy the 4096 bytes over to a memory mapped space when O_DIRECT could allow it to be DMA'd directly (as long as what you're writing is an exact page unit or multiple thereof). Same for reading.

More info on my virtual ring buffer implementation is at http://vrb.slashusr.org/.

It's invalid to compare mmap with direct I/O

January 12, 2007 - 8:31pm

If you use direct I/O, the API you get is the read/write one. So comparing with mmap behaviour is not very useful.
If you have to zero a block disk, the simple solution to the misbehaviour you identify is to simply call write() - it'll be the same thing that calling write with direct I/O enabled, except that it allows caches to work better.
Well, then the only thing to choose, then, is DMA vs. caches, and that's a good point - if the reason for your binary was this it's a good one, only badly explained.

Also, a simpler (and faster) way to create an area of binary zeros is to mmap /dev/zero or do an anonymous mapping, possibly readonly, and pass the returned address to write() to get it down on the disk. Mmapping /dev/zero has also the advantage that you read from a single global shared page (shared at kernel-level).

I also appreciated your VRB idea, thanks for posting about it!

O_DIRECT and preventing data integrity

January 17, 2007 - 6:41pm
Anonymous (not verified)

OK guys,

What if keeping data integrity on storage is crucial (e.g. in case of power failure or system crash ?)

You have choice between O_SYNC and O_DIRECT.

O_SYNC - works, but flushing large amount of cache may be very time consuming

O_DIRECT - works and thanks to bypassing kernel buffers the performance bottleneck effect does not appear - no buffers - no flushing - no delays.

So, is there any alternative in such case for O_DIRECT ?

man fsync?

January 18, 2007 - 3:31am
Anonymous (not verified)

man fsync?

fsync is not enough

January 18, 2007 - 7:45am
Anonymous (not verified)

man fsync:

In case the hard disk has write cache enabled, the data may not really be on permanent storage when fsync() / fdatasync() return.

This is different from

January 18, 2007 - 2:08pm
Anonymous (not verified)

This is different from O_DIRECT?

yes, it is different

January 22, 2007 - 7:50am
Anonymous (not verified)

It is different as information about order of write operations is lost when data goes to disk through write-back cache even for fsync.

You seem to be confused.

January 23, 2007 - 3:35am
Anonymous (not verified)

You seem to be confused. This subthread was not about write ordering, but about being sure that the data hit the platter. Since hd write buffering is a layer below the OS it doesn't make a difference whether you use O_DIRECT or fsync().

Theoretically Matters

November 14, 2007 - 5:17am

Suppose you have a disk implementing hardware write buffering. It's not unreasonable to suppose that disk makes some guarantees about data reaching the actual platters, particularly that if write A hits the hardware before write B then a power down will never result in a disk where B has been written but A has not. Presumably one could guarantee this by having capacitors in the drive guaranteeing it has enough power to finish any sequence of writes it reorders (or through some complex sector remapping algorithm).

In such a situation it could make a lot of difference if the kernel reorders the writes before an fsync. Of course you could avoid this problem by causing the disk to sync after every single write but unless you can only sync your own data (and not the whole buffer) it would be a performance hit.

No idea if this is an issue in practice.

After the data is sent out

January 28, 2007 - 3:07am
Anonymous (not verified)

After the data is sent out to the hard disk, the kernel has nothing more to do with it. None of Linus' minions can save you if your hard disk decides to store data on the on-board RAM chips, and then dies before it can write it out.

My impression is that database folks are more concerned about failures in other parts of the computer system. For example, we'd like to have a consistent database on the drive even after a kernel panic.

But if the drive itself is hosed, then yeah... you are probably going to lose the data on that drive, unless you're using RAID or something.

After the data is sent out

July 21, 2007 - 4:55pm
Anonymous (not verified)

After the data is sent out to the hard disk, the kernel has nothing more to do with it. None of Linus' minions can save you if your hard disk decides to store data on the on-board RAM chips, and then dies before it can write it out.

That's not necessarily true. The kernel can send a transport-specific command to tell the drive to flush it's cache, thereby assuring that the data is actually on the platters.

Performance

February 1, 2007 - 1:22pm
Anonymous (not verified)

Last time I checked, I could not get high performance with O_DIRECT. I'm talking about RAID arrays that can do over 1 GB/s while using a filesystem. The typical case is for uncompressed video as used in the post-production business (4K-10bit 24fps is 4000*3000*(3*10/8)*24= 1GB/s of sustained throughput. Last time I checked, using the page cache will just completely kill the performance. And this is single use data, the simple case is DMA in from disks, DMA out to projector (or DMA in from film scanner, DMA out to disk). Caching in this case becomes a serious annoyance (since it add a useless memory copy).

Write behind

February 15, 2007 - 3:45am
Topi (not verified)

Hi,

Here's a comparsion of three different programming methods.

1. System has true DMA-capable HD (possibly a high performance RAID array).
2. An user mode application generates a huge bunch of data to be written on disk.
2.1. data is written in K-sized blocks (e.g. 100 megs).
2.2. application doesn't need the data (within memory nor within the file) after it has sent it to the kernel.
3. Application would benefit if write behind caching is implemented.
4. Three ways to see this would be:
4.1. fadvise way:

 buf = allocate_K_size_buffer();
 for() {
  generate_data_K_size(buf);
  write(fd,buf,K); // Copy point 1
  posix_fadvise(fd,pos,K,POSIX_FADV_DONTNEED); // To prevent unneeded caching
 }

4.2. O_DIRECT way:

 open(...O_DIRECT);
 ...
 buf = allocate_K_size_buffer();
 for() {
  generate_data_K_size(buf); // Block point 2
  write(fd,buf,K); // Block point 1
 }

4.3. third way:

 for() {
  buf = allocate_K_size_buffer();
  generate_data_K_size(buf);
  write_and_free_memory_syscall(fd,buf,K);
  posix_fadvise(fd,pos,K,POSIX_FADV_DONTNEED);
 }

These yield to different performance factors.

In 4.1. kernel has to copy (Copy point 1) data from buf to internal buffer, which is possibly freed after actual write. Copying takes time and consumes (possibly unnecessary) memory.

In 4.2. application is blocked either in "Block point 1", or when next time buf is dirtied (Block point 2).

In 4.3. application tells kernel to take ownership of the memory buffer (buf) and to write it to disk and to free it afterwards (from application point of view). Thus giving the kernel an opportunity to use buf as cache data. Fadvise tells kernel to forget the data afterwards. No memory copying needed. No systematic blocking happening.

Is there a function / set of functions to proceed 4.3. way?

--
Topi

Write behind - posix_fadvise vs O_DIRECT with DMA'd MMAP buffers

May 2, 2007 - 10:43am
ZBDeebs (not verified)

I would extremely benefit from knowing which way is best. I have tried to utilize posix_fadvise(fd,pos,size,POSIX_FADV_DONTNEED) and in another instance O_DIRECT, but both still use the cache unneccessarily.

I have a specific case where I have a device driver that has multiple pci_alloc_consistent buffers that it has mapped into user space through MMAP. Then, my user app maps into those buffers (each 4MB) and then needs to recursively call a file write to send the data over to a striped RAID fully capable of throughputs above 200MB/s, but I only receive 70-90 MB/s. I don't need to look at the data I just accessed in the mmap'd buffer or the file. BUT, the mmap'd buffer will get accessed again, but it will be new data. Maybe there might be a way to tell that the cache is dirty and needs to be refreshed or just forgotten, maybe that's what we're trying to do through the posix_fadvise. I have tried posix_madvise, but I receive errors from the call mostly I suspect because the mmap'd buffers are PCI addressable RAM from pci_alloc_consistent.

My first implementation is a simple write from my mmap'd buffer to the file through a write call.

write(fd,mmap_buffer[i],4*MB);

Or, I use O_DIRECT to open the file and then I create a page aligned buffer with posix_malign to 4MB and then I need to call...

memcpy(malign_buf,mmap_buffer[i],4*MB);
write(fd,malign_buf,4*MB);

I have not had any performace boosts by utilizing posix_fadvise or even O_DIRECT.

Avoiding copies by the CPU

April 22, 2008 - 7:09pm
Miles (not verified)

Thank you!!! This is similar to the scenario I have. In my case high throughput isn't the issue/requirement -- cpu conservation is. In my case I have an embedded system with a piece of hardware supplying a never-ending (for all intents and purposes) stream of data. The CPU will never touch any of the incoming data -- it just needs to go to disk (a file). (Or vice versa... a never ending stream of data needs to come off the disk and get DMA'ed to an output device)

Linus religiously poo-poo's O_DIRECT but how else can you achieve DMA-in -> DMA-out through user-space without incurring data movement with the processor? Anything short of O_DIRECT will leave you with a copy_to/from_user.

O_DIRECT and Kernel Cache / Code

March 2, 2007 - 9:59am
George Presura (not verified)

The advantages of using O_DIRECT becomes obvious when:

  1. an application open exclusivelly a file for i/o operations (no one else opens that file)
  2. the application keeps it's own cache (maybe mlock()ed)
  3. the application performs a lot of i/o operations
  4. the applications cannot use raw partitions (very important)

This is because:

  1. DMA transfers
  2. skip kernel code to do page-cache
  3. the application use 2 threads, one for reading data and one for writing data

Thus, the i/o wait time is minimal and such a system would perform like a real-time.
Also, using O_DIRECT the application can implement a better algorithm than the Linux elevator, and can write data to disk at variable intervals than the /proc/sys/vm/dirty_writeback_centisecs (thus can pass hours or days before writing data to disk - of course this applies to systems where you do not worry about power fails, hdd crash, etc.)

Such applications are databases, log systems, etc.

Another scenario: imagine an application performing a lot of writes per second (as much as possible, thousands or tens of thousands) and also performs a lot of reads, all from about the same zone of the file, within a few tens or hundrends of megabytes. This application can store all that data in RAM and writing it to the disk only when it have contiguous zones big enough so that it will write tens of megabytes at once using the full sequential write speed of the disk. The pdflush daemon cannot gurantee this, right ?

My question is: using madvise() and posix_fadvise() makes programming a little harder and still executes (a lot of) code in kernel mode. Is there a way to meet all the requirements above using anything else than O_DIRECT ? 

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。如发现有害或侵权内容,请点击这里 或 拨打24小时举报电话:4000070609 与我们联系。

    猜你喜欢

    0条评论

    发表

    类似文章
    喜欢该文的人也喜欢 更多