Commit 3b1e56ec authored by Stewart Brodie's avatar Stewart Brodie
Browse files

Import from SrcFiler of Browser fetchers

parents
*,ffb gitlab-language=bbcbasic linguist-language=bbcbasic linguist-detectable=true
c/** gitlab-language=c linguist-language=c linguist-detectable=true
h/** gitlab-language=c linguist-language=c linguist-detectable=true
cmhg/** gitlab-language=cmhg linguist-language=cmhg linguist-detectable=true
*,fe1 gitlab-language=make linguist-language=make linguist-detectable=true
***************************************************************************
* *
* Project: Networking *
* *
* Module: HTTP *
* *
* Created: Wed 06-Mar-96 By: Kevin Bracey *
* *
* First version: 0.22 *
* *
* Copyright: (C) 1996, Acorn Computers Ltd., Cambridge, England. *
* *
***************************************************************************
Purpose:
========
HTTP fetcher for URL module.
***************************************************************************
Change Log:
===========
---------------------------------------------------------------------------
Version: 0.22 Wed 06-Mar-96 Kevin Bracey
Checked in to source filer.
---------------------------------------------------------------------------
Version: 0.24 Thu 05-Jun-97 Carl Elkins
---------------------------------------------------------------------------
Version: 0.25 Thu 05-Jun-97 Carl Elkins
*Should* be version 0.24, but can't persuade SrcFiler of that !
Added code to cope with additional error return states from socket calls,
also adding an extra 'error' state (it *is* an error, but it's not
necessarily fatal, so isn't returned as an error).
Code will now trap EWOULDBLOCK, EHOSTUNREACH, ENETUNREACH and return state
'aborted before connection established' (64) to client.
---------------------------------------------------------------------------
Version: 0.30 Tue 29-Jul-97 Phillip Temple
This version not checked in.
---------------------------------------------------------------------------
Version: 0.31 Tue 29-Jul-97 Phillip Temple
Now copes with Netscape cookies, RFC2109 cookies and the new proposed
cookie2 protocol. Has 2 file formats, both Fresco compatiable.
Added a couple of ENV variables:
"Browse$CookiePath" points to the file cookies are cached in
"Browse$FakeNetscape on" to spoof Netscape header
"Browse$AcceptAllCookies off" to force the browser to individually
accept each cookie rather than accept all unconditionally
"Browse$CookieFileFormat <n>" where 'n' is 1 or 2.
File format 1: basically reconstructs the SetCookie header and
writes it to file. Advantages - parsing already in place, totally
flexible, altering parsing auto-updates reading in of file
File format 2: inflexible, and totally RFC-uncompliant, file-format
devised for Fresco with no apparent logic. Cannot cope with more
information than the basic Netscape cookie protocol provides.
Handles cookie expiry. Very robust parsing: can accept cookies
split over diff header lines, broken quotes, etc.
Added a few SWI calls:
SWI "HTTP_EnumerateCookies" will generate a list of cookies either waiting
in a queue to be accepted, or cookies already in memory.
SWI "HTTP_ConsumeCookie" will take a cookie from the queue and either
accept or reject it.
SWI "HTTP_AddCookie" allows you to directly insert a cookie into
memory. Needed to implement JavaScript.
---------------------------------------------------------------------------
Version: 0.32 Tue 29-Jul-97 Carl Elkins
Minor change to connect sources to ensure that the case of being unable to
resolve a hostname doesn't 'drop through the net' - takes care of correctly
faulting a first fetch of a bad hostname.
Cannot, however, get these sources to compile to a reliable binary, I suspect
that the problem lies in the cookie/header parsing routines from the debug
output and failure point. Therefore, I'm checking this back in to pass
control back to P.Temple, for further work on the cookie/header code.
---------------------------------------------------------------------------
Version: 0.33 Fri 01-Aug-97 Phillip Temple
Couple of bugs in cookie cache reading and writing fixed. Extra traps
solved reliability problems. Still to do: cookies work in local time
rather than GMT.
---------------------------------------------------------------------------
Version: 0.34 Tue 05-Aug-97 Phillip Temple
Build same as 0.33. Documentation updated to include HTTP_UserAgent,
which was added as a fix to allow the browser to fake Netscape whilst
including its own version details.
---------------------------------------------------------------------------
Version: 0.35 Wed 06-Aug-97 Andrew Hodgkinson
A slightly traumatic check-in (SrcFiler crashed half way through), but
got there in the end - bug fix to the finalisation code (it didn't
free a malloc block from the initialisation code), plus tidied up the
module (as in SrcFiler module). Module version number incremented to
0.35 as the bug fix is significant and it brings the SrcFiler versions
back into line.
---------------------------------------------------------------------------
Version: 0.36 Wed 17-Sep-97 Carl Elkins
Source not checked in, but forces version numbers into line.
---------------------------------------------------------------------------
Version: 0.37 Wed 17-Sep-97 Carl Elkins
Added generation of host: headers, allowed user agent to be specified on a
per session basis, added returning on content length information as per URL
spec. version 0.11.
Only claims HTTP 1.0 compliance, though, not 1.1, which might screw up cookie
support - unfortunately, much more work is required if we are to be 1.1
compliant to the required extent.
Modified build system used for module so as to eliminate duplication of
version information in headers/source files (needs cmhg 5.16 or later).
Fixed some bugs, but suspect that there's still a memory leak in the
cookie/header parser ; ideally, timer should be pulled off tickerv, too.
---------------------------------------------------------------------------
Version: 0.38 Wed 17-Sep-97 Carl Elkins
Screwed up the entry register assignments to HTTP_UserAgent in 0.37 -oops ;
now uses similar assignments to other SWIs.
---------------------------------------------------------------------------
Version: 0.39 Fri 19-Sep-97 Carl Elkins
Apparently fixed the interaction with !Fresco problem, by removing a bit of
code that's been there since 0.22 (strange).
Tried to improve the status returning, but it doesn't map well onto the URL
defined status bits :-(
---------------------------------------------------------------------------
Version: 0.40 Tue 30-Sep-97 Carl Elkins
Source not checked in, however only changes were minor ones to build files to
further tidy the build process wrt duplication of information, and tidying of
code layout to improve readability (IMHO).
---------------------------------------------------------------------------
Version: 0.41 Tue 30-Sep-97 Carl Elkins
Removed timer from TickerV, replacing with a callevery/callback pairing with
semaphores to address reentrancy - this should lighten the machine load as
well as being safer at module termination etc.
Also modified http_start call to support client specification of a 'User
Agent' string.
---------------------------------------------------------------------------
Version: 0.42 Fri 03-Oct-97 Carl Elkins
Minor changes to SWI dispatcher, also reorganised SWI table, to allow for
future expansion of both generic and HTTP specific SWIs.
---------------------------------------------------------------------------
Version: 0.43 Wed 08-Oct-97 Stewart Brodie
Minor change to internal state table to insert a non-block DNS resolver
phase.
---------------------------------------------------------------------------
Version: 0.44 Thu 09-Oct-97 Stewart Brodie
0.44
----
Host header is now sent correctly even when proxying. Before, the Host
header being sent was that of the proxy which was being used(!)
Cookie code removed to new source file and now compiles cleanly.
Header parser problems fixed - it doesn't think that the hours and minutes
are headers because they are followed by colons any more! Still can't cope
with continuation lines like it should though.
User-Agent string is now filtered for non-ASCII non-printing characters
to stop some web servers from barfing on them. (Notably www.hotbot.com)
DNS lookup code detects the lack of the Resolver module and falls back
to previous behaviour if it wasn't found (requirement for Argo)
---------------------------------------------------------------------------
Version: 0.45 Tue 14-Oct-97 Stewart Brodie
Looks up port numbers with getservbyname instead of relying on hardwired
codes.
Lots of dead code removed that was never called. Mostly due to state
table fixes.
Fixed all the holes in the state table and removed lots code that attempted
to perform dodgy "optimisations" which would rarely, if ever, be used.
---------------------------------------------------------------------------
Version: 0.46 Wed 15-Oct-97 Stewart Brodie
Many memory leaks fixed.
Module no long sits on callevery to check the progress of fetches with a view
to timing out ones that haven't progressed for whatever reason for a given
delay (too short to be useful). The Internet module already implements
timeouts anyway.
---------------------------------------------------------------------------
Version: 0.47 Wed 15-Oct-97 Stewart Brodie
---------------------------------------------------------------------------
Version: 0.48 Wed 15-Oct-97 Stewart Brodie
Forgot to change the CMHG header file - sorry
---------------------------------------------------------------------------
Version: 0.49 Fri 17-Oct-97 Stewart Brodie
Timeouts removed completely - I'd left in an internal timeout in the
ReadData SWI which could still have been triggered occasionally
This version does cookies by default and contains all my bug fixes for
it.
---------------------------------------------------------------------------
Version: 0.50 Tue 21-Oct-97 Stewart Brodie
Module now rewrites the headers to make them conform more closely to the
strict format laid down in the HTTP specifications. This should help any
clients parsing HTTP responses since they shouldn't find anything to wild
now. It rejoins continuation lines, and strips extraneous spaces.
This is the first version that hasn't passed a verbatim copy of the server
data to the client. Future versions will not do so either, unless the
SWI specification is extended to allow verbatim sends and receives.
Complies with "Acorn URL fetcher API specification" version 0.16. Supports
previous behaviour for backward compatibility too though.
---------------------------------------------------------------------------
Version: 0.51 Tue 21-Oct-97 Stewart Brodie
Major fix to cookie code (froze machine if a Browse$CookieFile was unset).
Bug affected all versions back to around 0.43. Luckily, AW97 CD release of
!Browse does set this variable.
This version will send GET and HEAD requests using HTTP/1.1. This version
is believed to be at least "conditionally compliant" with RFC2068 (HTTP/1.1
specification). It does not do persistent connections (not a requirement of
HTTP/1.1) but does do chunked transfers (a requirement).
---------------------------------------------------------------------------
Version: 0.52 Thu 23-Oct-97 Stewart Brodie
Port number is only appended to Host headers when it is not the default
HTTP port (80).
Connection header is parsed and other headers marked as connection specific
are stripped before the client sees them (requirement of RFC2068).
---------------------------------------------------------------------------
Version: 0.53 Thu 23-Oct-97 Stewart Brodie
Headers don't get repeated in transaction back to client.
---------------------------------------------------------------------------
Version: 0.54 Thu 23-Oct-97 Stewart Brodie
Module help string modified to remove comment.
Specific release version for Acorn World 97 patch floppy disc.
---------------------------------------------------------------------------
Version: 0.55 Wed 29-Oct-97 Stewart Brodie
Number of pending queued cookies is now limited to 100 to avoid memory
problems when used with clients that don't understand cookies.
Receiving of chunked data optimised to allow client buffer to be filled
as far as possible instead of giving up at the end of a chunk.
Fixed minor memory leak on proxied requests
Cookie code no longer munges all the headers. This could prevent
authentication working usefully in earlier versions. :-(
Error messages are now stored in ResourceFS.
Removed potential register corruption on exit from ReadData SWI fixed (caused
browser to not be able to display percentage completions on some downloads)
Now uses a veneer to realloc to bypass bug in RISC OS 3.1 C library if
running under RISC OS 3.1.
Client-supplied headers and data are examined and potentially rewritten or
dropped from the request IFF they compromise the integrity of the request.
Module no longer counts the HTTP response header as part of the total
expected length of the data transfer. This bug was not noticed before as it
is extremely hard to reproduce (needs very specific network conditions).
---------------------------------------------------------------------------
Version: 0.56 Fri 07-Nov-97 Stewart Brodie
Minor change to remove an optimisation which is suspected as being the root
cause of the Browser Data Abort failures due to a nasty buffer overwrite in
HTTP.
---------------------------------------------------------------------------
Version: 0.57 Wed 12-Nov-97 Stewart Brodie
Generic URL parser code integrated into module and proxy detection better
which should allow counters to work when they include the page URLs within
their own URL.
Loads the URL module if it fails to register with it on startup. Responds
to URL 0.18's service calls.
Client's User-Agent suffixed by HTTP module version number.
---------------------------------------------------------------------------
Version: 0.58 Wed 19-Nov-97 Stewart Brodie
Cookie code no longer overwrites zero page when it is adding multiple
cookies to request headers.
Cookie parser tightened up to be more careful about date parsing.
---------------------------------------------------------------------------
Version: 0.59 Tue 25-Nov-97 Stewart Brodie
Reading non-HTTP responses breaks down if no end-of-line characters are
ever discovered in downloaded data. Fixes PAN-00709, PAN-00725
---------------------------------------------------------------------------
Version: 0.60 Thu 27-Nov-97 Stewart Brodie
This version relies on a minimum of URL 0.20 (for URL_ParseURL). It also
won't append it's identifier to the User-Agent if it is already in there.
The usage of URL_ParseURL to do all its URL processing is the major new
feature though.
---------------------------------------------------------------------------
Version: 0.61 Mon 01-Dec-97 Stewart Brodie
Minor alteration to prevent multiple Cookie2 headers being added to
transactions.
Internal code change to move c.dns, c.connect and c.stop to be the Generic
versions to aid maintenance.
---------------------------------------------------------------------------
Version: 0.62 Tue 02-Dec-97 Stewart Brodie
Header parser alteration to detect badly broken HTTP servers which do not use
consistent EOL sequences. This confused the header parser which was getting
stuck waiting for possible continuation lines to arrive. This was only
really confusing if there is no entity body included in the response.
The fix need not be applied to any other of the software containing the
header parsing code, as HTTP will generally have "got" to it first and
corrected it.
---------------------------------------------------------------------------
Version: 0.64 Thu 04-Dec-97 Stewart Brodie
(version 0.63 never checked in - test version)
Contains fix for talking to badly broken NCSA 1.5.1 servers (they send
illegal responses which earlier version of HTTP and the browser do not
recognise). Again, only HTTP needs the fix because it will correct the
fault before regenerating the headers when returning them to its clients.
Relinked with version 5.06 of the linker (solves ReInit data aborts)
Specific build for December 4 development release.
---------------------------------------------------------------------------
Version: 0.66 Wed 17-Dec-97 Stewart Brodie
(version 0.65 never checked in - test version)
This version fixes a bug which causes a zero byte to be written to memory
location zero when a protocol other than HTTP is being proxied.
Specific build for December 18 development release.
---------------------------------------------------------------------------
Version: 0.67 Wed 07-Jan-98 Stewart Brodie
Date parsing code completely replaced with my own code. This works (but may
expose a bug in the SharedCLibrary that gets confused by 2100 not being a
leap year) Cookie path validation fixed (would have become broken in 0.60
when URL parser code was introduced which didn't have a leading / on the
path).
---------------------------------------------------------------------------
Version: 0.68 Thu 08-Jan-98 Stewart Brodie
Never checked in - superceded by 0.69 within minutes to "release".
0.68 and later use the new proxying method whereby the proxy to use is
passed in R7 (when R0:31 is set). This simplifies a lot of code (a lot
could be deleted) and removes problem of trying to detect when URL wants
us to proxy something. Requires URL 0.28.
---------------------------------------------------------------------------
Version: 0.69 Thu 08-Jan-98 Stewart Brodie
The cause of the mysterious "ffb" strings has now been determined - it was
the header detector failing to not try to reconstruct header lines when it
was parsing the chunk size. This is the only change from version 0.68.
---------------------------------------------------------------------------
Version: 0.70 Tue 13-Jan-98 Stewart Brodie
Never checked in - superceded by 0.71 very quickly due to major fault
being discovered.
---------------------------------------------------------------------------
Version: 0.71 Thu 15-Jan-98 Stewart Brodie
Module traps failure to connect much more reliably, and issues the correct
error messages when connection failure/timeout occurs.
---------------------------------------------------------------------------
Version: 0.72 Thu 29-Jan-98 Stewart Brodie
Cookie path comparison fault fixed. Now passes the test team's cookie
tests securely. (It fails one test due to an erroneous "expected result"
declaration in the test plan)
Implementation Details
======================
Author: Stewart Brodie
Date: 29th October '97
1. Introduction
2. "Session.headers"
2.1 During Requests
2.2 During Responses
3. Reading data
4. Chunked data transfers
5. Data buffers
6. Counters
7. Errors
1. Introduction
The HTTP module is one of the more complex fetcher modules. It performs many
different operations required by the HTTP specifications and attempts to
fulfil them reliably and in a manner which is non-destructive to remote
servers. This document details *some* of the implementation decisions which
somebody maintaining this module may need to be aware of.
HTTP fetches are started with a call to SWI HTTP_GetData (via URL_GetURL).
Then the client application repeatedly calls ReadData and Status (again
via the URL module) until all the data is returned, and then HTTP_Stop (via
URL_DeregisterURL or URL_Stop) to clean up.
This is the standard fetcher interface, however, many things are done
internally to optimise use of resources, which are described below.
2. "Session.headers"
The Session data structure contains a "headers" member which is used for
accumulating HTTP headers. This is used for BOTH the request headers and
the response headers.
2.1 During Requests
When the request is being formulated (during HTTP_GetData), headers are added
to the list with "http_add_header". This function adds headers to a header
list (the function itself is a generic list builder). The header list can be
thought of as an associative array (duplicate indices allowed) which is kept
in the order in which items were added. Routines are provided for locating a
header, deleting a header, deleting all the headers. Once all the standard
headers have been added, attention turns to the client-supplied data (pointer
in R4 on entry). These are also parsed and any entity-body in the client data
is then duplicated into Session.data (length in Session.data_len).
Once the list is complete, the headers are all examined and inappropriate
ones are removed. If the method is an idempotent read (GET or HEAD), any
entity body is discarded completely and related headers are also ditched
(Content-Length, Content-Type, Transfer-Encoding). This can help clients
which "forget" to zero R4 (eg. !Browse (as of October 1997) on requests
generated by a redirect respose to a POST).
If the method was a POST or PUT, then a content-length header is sought, and
is added if it was missing. If it was already there but indicated an
amount of data exceeding Session.data_len, it is rewritten to match
Session.data_len.
Finally, a buffer is created and filled with the complete request buffer
and stored in Session.fullreq. The act of pushing everything into the buffer
also EMPTIES the Session.headers list. The first "header" line is separated
from its "value" by a space, subsequent lines use a colon. The first
"header" is actually the HTTP method being used (eg. "GET" or "POST") and the
value is the URI plus a space plus the HTTP version token - hence the reason
for the special treatment of the first line.
2.2 During Responses
A similar process is followed when reading the response. Whilst the headers
are still being retrieved (Session.donehead is zero or Session.chunking is
non-zero and Session.chunkstate is reading headers or footers) a COPY of the
data ready on an incoming socket is requested (recv is called with MSG_PEEK
in the fourth parameter) and written into an internally allocated buffer.
Headers are parsed from this just like they were from the client data on a
request - it uses the same routine. Content-length is trapped and parsed,
the value stored and the header then DISCARDED.
Once all the headers are known, then they are filtered as appropriate.
Content-length is ignored if there was a transfer-encoding. The HTTP version
is reset to 1.0 for the benefit of the client. Transfer-Encoding is parsed
to look for the string "chunked" which sets Session.chunking to non-zero and
the header is then removed. Connection headers are removed and any headers
matching tokens in the connection header are also removed (they do not
concern our client, as they refer solely to characteristics of the connection
between this module and the remote server). Incoming headers are sanitised
(continuation lines are rejoined, and leading/trailing spaces are removed and
generally tidied up). Cookies, if compiled in (need -DCOOKIE in Makefile),
are then parsed and stored if appropriate. However, they are not removed
from the header list.
After all of that, if not chunking, the content-length header is added back
to the response headers but at the end of the header list, and a buffer is
generated containing the complete response header. Thus not all of the
headers will be seen by the client, and those that are present are not
necessarily in the same order that the server sent them. This should not
concern the client.
As each header line is read from the buffer (including reconstructing
split lines), the data representing that header is read and discarded from
the TCP/IP stack (recv is called again but without MSG_PEEK). (Actually, the
header parser keeps a running total of how much data has been "consumed", and
only discards the data in one go when it has processed as many headers as it
could.
3. Reading data
Headers are read into a private buffer (Session.buffer). When reading body
data, data is read directly into the client's buffer. The function
"http_write_data_to_client" must always called in order to update R2, R3 and
R4 when client data has been generated and that function copies data into the
client buffer only if it needs to.
R2, R3 and R4 are updated so that future calls to write more data are
appended to the buffer (if they occur during the execution of the same SWI
call, of course). This can be done because the wrapper function
http_readdata enforces the spec requirement that R2 and R3 be preserved
across the call. Currently, the only time when data will be written to the