I've played with the implementation of fwrite, and had several different
implementations to compare them:

1) "for (i=0; i<nbytes; ++i) putc(*ptr++, stream);"  (roughly).  This is what
current C libraries up to and including 5.26 do.

2) align write buffer using (1) or (4); OS_GBPB blocks_to_write * block_size
bytes in one go; write remainder using (1) or (4)

3) as (2), but call OS_GBPB to write block_size bytes blocks_to_write times.

In addition, some alternative strategies to do the buffer alignment:

4) if (space_in_write_buffer >= nbytes) memcpy the data into the write buffer
   and exit (note, you don't flush the buffer until the next write is done)
   else memcpy space_in_write_buffer bytes into the write buffer;
   then force the buffer to be flushed.

When writing the remainder in (2) or (3), the first condition in (4) will
always be true.

Choice between methods (2) and (3) is controlled by the WRITE_ONCE macro
in stdio.c.

Methods (2) and (3) avoid the bytewise data copy from the caller's buffer
into the stdio buffer.

It's always going to be necessary to align the write buffer before attempting
the block write because there might be unflushed data in the write cache
already or this might be the first write in which case the internal data
structures are not set up to do writing.

The test was to write an 8M file to ADFS, NFS, ATAFS, SCSIFS and RAMFS. 
Times in seconds. The block size is the parameter to fwrite used in the test
program.  Times varied up to around 2%, using StrongARM RiscPC 200MHz, our
RISC OS 4 builds.


Strategy    Block size      ADFS        NFS      RAMFS    ATAFS

   1        5K              7.50        20.68    1.87     7.16
            32K             7.51        20.76    1.94     7.32
            8M              7.47        20.85    1.95     7.42

   2        5K              7.44        19.89    0.76     6.40
            32K             7.46        14.39    0.62     3.40
            8M              7.45        10.29    0.52     2.40
   
   3        5K              7.44        20.29    0.78     6.40
            32K             7.47        19.35    0.70     6.10
            8M              7.47        18.92    0.63     6.09
   

We tried SCSIFS on an A540, and the timings are of course completely
different from my StrongARM RiscPCs - but the relative stats were that
strategy 3 was twice as fast as strategy 1, and strategy 2 was twice as fast
3 for 5K blocks, but 9-10 times as fast for 32K and 8M blocks!  We also tried
SCSIFS on a StrongARM Risc PC and found that 32K or larger blocks in strategy
(2) made a significant difference, but (3) didn't and neither did (2) at 5K.

Using or not using (4) didn't make any measurable difference.  These are all
word-aligned writes.

Misaligning the stdio buffer but writing a few bytes before the large write
made no measurable difference.

The parallelism exploit in the NFS module where multiple transactions can be
run in parallel when >8K of data is to be transferred in a single call
through the filing system entry point accounts for the variation in NFS;
RAMFS benefits a great deal from removing the extra memory copy.  The bulk
transfer really helps SCSIFS - but that may be down to the architecture of an
A540.  Any or all changes seem to make no difference to ADFS
(ADFSBuffers didn't affect the speed measurably).

The major downside is that you lose TaskWindow multi-tasking during writes of
large blocks.  If applications used setvbuf to set up a 32K buffer, then they
should benefit quite a bit - particularly on non-Risc PC IDE bus bound filing
systems.

-- 
Stewart Brodie, Senior Software Engineer
Pace Micro Technology PLC
645 Newmarket Road
Cambridge, CB5 8PB, United Kingdom         WWW: http://www.pacemicro.com/