Paperboy: News page HTML source notes
~~~~~~~~  v1.00 (28-Oct-1999)
 
 
WIRED NEWS
==========
Root Index page (http://www.wired.com/news/)
----------
Index page contains links to sections[1]: "Business", "Culture",
"Technology", and "Politics". The URLs for each of these sections is 
'http://www.wired.com/news/'.lc(section).'/xxxxxx.html'. This last section
appears variable and so will have to be sourced from the index page. The front 
page only reproduces stories from these subsections.

Section Index page (eg. http://www.wired.com/news/business/....)
-------------
Contains headlines/summaries between "<!-- {start,end} content -->" with 
headlines in between the anchor tags. Next <font>...</font> pair can be 
ignored (contains time). Contents of <font>...</font> after that contain the 
summary. Skip forward to next <a ...>...</a> for next story.

Headlines and/or summaries can contain HTML tags.

"Fancy" Story page
-------------
The stories are split in this and so the plain versions should be used - 
however this presents a problem as the URLs to the plain version do *not* look 
like they can be derived. This page must be searched for the appropriate 
"printing" hyperlink (this is the href containing the string "/print/").

Plain Story page
-----------
Story text between "<!-- STORY -->" and "<!-- END STORY -->". Story proper
begins at first "<font ...>" tag (before that is meta content). Remove text
between "<P><CENTER><HR NOSHADE>" and "<HR NOSHADE></CENTER>".

Image "http://static.wired.com/news/images/pix155.gif" is referenced towards
the end of the story and should be removed (or changed to a locally cached
version).

Related links only require prefixing with server name. There are dodgy
<pagebreak> tags which should be removed, otherwise Java (by default) generates
an exception whilst trying to display the page.


THE REGISTER
============
Root Index page (http://www.theregister.co.uk/morenews.html)
----------
Contains links to stories in between "<TD WIDTH=590 VALIGN=TOP>" and "</TD>".
Headlines are only hyperlinks in this section - no summaries available.

Story page (http://www.theregister.co.uk/yymmdd-xxxxxx.html)
-----
Story is between same tags as above and then within "<p>" and "<a href="/">"
within that block.


BBC ONLINE
==========
Root Index page (http://news.bbc.co.uk/text_only.htm)
----------
No need to decode this as each of the sections can be accessed directly.
However, the site says it's currently undergoing a "redesign". The format below
may change...

Section Index page (http://news.bbc.co.uk/low/english/xxx/default.htm,
-------------           where xxx = { world, uk, uk_politics, business,
                                      sci/tech, health, education, sport,
                                      talking_point } )

Headlines/summaries between 2nd "<HR>" and "<TABLE ...>". Headline/URL is
<B><A href="URL from /">...</A></B>. Summary is either then between next
<BR> and </P>. HTML tags (comments, extraneous <p> etc.) should be stripped
from both headline and summary.

Story page (http://news.bbc.co.uk/low/english/newsid.....stm)
-----
Story content is between second <HR> and <HR> after that. Images linked will
need rewriting and stored as some form of hash-code in an images directory
(perhaps). Prefetch some images (eg. /furniture/nothing.gif) and if an image
is already present don't refetch. Mmmm, cachy.


PROCESSING API
==============

A class which has the concept of a "current location" allowing the HTML file
to be read through in blocks. Some of its methods would be:

 * Get URL which matches a String
 * Find/crop to text between tag 1 & tag 2[*], either inclusive or exclusively
 * Strip all HTML tags
 * Remove text between two tags from middle of block
 * Remove a String from the middle of block, either globally or just next
 * Reset pointer to start of block
 * Move pointer forward to next occurence of a String