Paperboy: News page HTML source notes ~~~~~~~~ v1.00 (28-Oct-1999) WIRED NEWS ========== Root Index page (http://www.wired.com/news/) ---------- Index page contains links to sections[1]: "Business", "Culture", "Technology", and "Politics". The URLs for each of these sections is 'http://www.wired.com/news/'.lc(section).'/xxxxxx.html'. This last section appears variable and so will have to be sourced from the index page. The front page only reproduces stories from these subsections. Section Index page (eg. http://www.wired.com/news/business/....) ------------- Contains headlines/summaries between "" with headlines in between the anchor tags. Next ... pair can be ignored (contains time). Contents of ... after that contain the summary. Skip forward to next ... for next story. Headlines and/or summaries can contain HTML tags. "Fancy" Story page ------------- The stories are split in this and so the plain versions should be used - however this presents a problem as the URLs to the plain version do *not* look like they can be derived. This page must be searched for the appropriate "printing" hyperlink (this is the href containing the string "/print/"). Plain Story page ----------- Story text between "" and "". Story proper begins at first "" tag (before that is meta content). Remove text between "


" and "
". Image "http://static.wired.com/news/images/pix155.gif" is referenced towards the end of the story and should be removed (or changed to a locally cached version). Related links only require prefixing with server name. There are dodgy tags which should be removed, otherwise Java (by default) generates an exception whilst trying to display the page. THE REGISTER ============ Root Index page (http://www.theregister.co.uk/morenews.html) ---------- Contains links to stories in between "" and "". Headlines are only hyperlinks in this section - no summaries available. Story page (http://www.theregister.co.uk/yymmdd-xxxxxx.html) ----- Story is between same tags as above and then within "

" and "" within that block. BBC ONLINE ========== Root Index page (http://news.bbc.co.uk/text_only.htm) ---------- No need to decode this as each of the sections can be accessed directly. However, the site says it's currently undergoing a "redesign". The format below may change... Section Index page (http://news.bbc.co.uk/low/english/xxx/default.htm, ------------- where xxx = { world, uk, uk_politics, business, sci/tech, health, education, sport, talking_point } ) Headlines/summaries between 2nd "


" and "". Headline/URL is .... Summary is either then between next
and

. HTML tags (comments, extraneous

etc.) should be stripped from both headline and summary. Story page (http://news.bbc.co.uk/low/english/newsid.....stm) ----- Story content is between second


and
after that. Images linked will need rewriting and stored as some form of hash-code in an images directory (perhaps). Prefetch some images (eg. /furniture/nothing.gif) and if an image is already present don't refetch. Mmmm, cachy. PROCESSING API ============== A class which has the concept of a "current location" allowing the HTML file to be read through in blocks. Some of its methods would be: * Get URL which matches a String * Find/crop to text between tag 1 & tag 2[*], either inclusive or exclusively * Strip all HTML tags * Remove text between two tags from middle of block * Remove a String from the middle of block, either globally or just next * Reset pointer to start of block * Move pointer forward to next occurence of a String