synchronize a file's state with that on disk

Hello Dev,

The App, which I am working on, uses fcopyfile() to copy file from external storage (USB, thunderbolt) to internal disk. Looks like this call provides the best performance.

fcopyfile(from, to, nullptr, COPYFILE_DATA);

I need to be sure the data is copied after the call. Are there any necessity to synchronize a file's state with that on disk the data? (e.g. call fsync())

Thank you in advance!

Pavel

Answered by DTS Engineer in 829540022

The App, which I am working on, uses fcopyfile() to copy file from external storage (USB, thunderbolt) to internal disk. Looks like this call provides the best performance.

The term "best performance" in context is always a bit concerning because there's a very common assumption that "lower level" inherently means "faster"/"better"/etc. That is NOT inherently true. Historically, it definitely was NOT the case. There were MULTIPLE version of macOS/iOS where copyfile was measurably slower than copying through NSFileManager (primarily due to a large performance gap between fts() and the underlying implementation of NSFileEnumerator).

Their performance today is basically identical, however, that's because copyfile/fts were improved to match NSFileManager's implementation and NSFileManager was then shifted onto copyfile().

Here is the right way to understand file copy and performance:

  • Single file copy performance tends to be fairly close across most copy implementations. It's the simplest case and doesn't provide very much opportunity for optimization.

  • The main reason to use copyfile is that it provides better progress support and detailed controls than NSFileManager (or other copy APIs), not because it's faster.

  • In terms of performance, copyfile has very good performance with limited memory and thread impact. However, it is certainly possible to write a faster copy routine IF you're willing to consume more memory and use multiple threads. This is the primary reason the Finder is faster than copyfile (once you remove preflighting time).

  • I STORNGLY recommend against implementing your own copy function. The issue here is that writing a "correct" copy function is much harder than it seems, primarily due a very large number of details and undocumented edge case. File copying is a problem the is trivial to describe in a basic way ("just copy the file!") and much more difficult to actually describe in detail ("What EXACTLY should be move and preserved and what should not be...").

I need to be sure the data is copied after the call. Are there any necessity to synchronize a file's state with that on disk the data? (e.g. call fsync())

This depends on exactly what you actually want/need. In terms of normal system functionality no extra work is necessary. The UBC (Universal Buffer Cache) ensures that the system maintains a coherent view of storage, so any accesses to the newly created file will be handled by the newly cached data, even if that data has not yet been flushed to disk. The unmount process will then ensure that data is properly flushed (which is why umounting exists at all).

However, if you're concerned about edge cases like being able to immediately cut power after the copy "finishes" then, yes, additional effort is required. You mentioned fsync, but what's actually required is what's mentioned in the fsync man page:

"For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail."

Looking at the fcntl man page, "F_FULLFSYNC" is described as:

"Does the same thing as fsync(2) then asks the drive to flush all buffered data to the permanent storage device (arg is ignored). As this drains the entire queue of the device and acts as a barrier, data that had been fsync'd on the same device before is guaranteed to be persisted when this call returns. This is currently implemented on HFS, MS-DOS (FAT), Universal Disk Format (UDF) and APFS file systems. The operation may take quite a while to complete. Certain FireWire drives have also been known to ignore the request to flush their buffered data."

That call does "everything" in the systems power to ensure that all data has ACTUALLY been written to persistent storage. It's entirely possible that poorly implemented hardware can still cause data loss, but there's nothing else the system can do about that.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Accepted Answer

The App, which I am working on, uses fcopyfile() to copy file from external storage (USB, thunderbolt) to internal disk. Looks like this call provides the best performance.

The term "best performance" in context is always a bit concerning because there's a very common assumption that "lower level" inherently means "faster"/"better"/etc. That is NOT inherently true. Historically, it definitely was NOT the case. There were MULTIPLE version of macOS/iOS where copyfile was measurably slower than copying through NSFileManager (primarily due to a large performance gap between fts() and the underlying implementation of NSFileEnumerator).

Their performance today is basically identical, however, that's because copyfile/fts were improved to match NSFileManager's implementation and NSFileManager was then shifted onto copyfile().

Here is the right way to understand file copy and performance:

  • Single file copy performance tends to be fairly close across most copy implementations. It's the simplest case and doesn't provide very much opportunity for optimization.

  • The main reason to use copyfile is that it provides better progress support and detailed controls than NSFileManager (or other copy APIs), not because it's faster.

  • In terms of performance, copyfile has very good performance with limited memory and thread impact. However, it is certainly possible to write a faster copy routine IF you're willing to consume more memory and use multiple threads. This is the primary reason the Finder is faster than copyfile (once you remove preflighting time).

  • I STORNGLY recommend against implementing your own copy function. The issue here is that writing a "correct" copy function is much harder than it seems, primarily due a very large number of details and undocumented edge case. File copying is a problem the is trivial to describe in a basic way ("just copy the file!") and much more difficult to actually describe in detail ("What EXACTLY should be move and preserved and what should not be...").

I need to be sure the data is copied after the call. Are there any necessity to synchronize a file's state with that on disk the data? (e.g. call fsync())

This depends on exactly what you actually want/need. In terms of normal system functionality no extra work is necessary. The UBC (Universal Buffer Cache) ensures that the system maintains a coherent view of storage, so any accesses to the newly created file will be handled by the newly cached data, even if that data has not yet been flushed to disk. The unmount process will then ensure that data is properly flushed (which is why umounting exists at all).

However, if you're concerned about edge cases like being able to immediately cut power after the copy "finishes" then, yes, additional effort is required. You mentioned fsync, but what's actually required is what's mentioned in the fsync man page:

"For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail."

Looking at the fcntl man page, "F_FULLFSYNC" is described as:

"Does the same thing as fsync(2) then asks the drive to flush all buffered data to the permanent storage device (arg is ignored). As this drains the entire queue of the device and acts as a barrier, data that had been fsync'd on the same device before is guaranteed to be persisted when this call returns. This is currently implemented on HFS, MS-DOS (FAT), Universal Disk Format (UDF) and APFS file systems. The operation may take quite a while to complete. Certain FireWire drives have also been known to ignore the request to flush their buffered data."

That call does "everything" in the systems power to ensure that all data has ACTUALLY been written to persistent storage. It's entirely possible that poorly implemented hardware can still cause data loss, but there's nothing else the system can do about that.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

thank you a lot for response!

synchronize a file's state with that on disk
 
 
Q