SAS Help and Documentation - Uncharted Waters

by Richard A. DeVenezia

Download and run the following perl script to take a look under the hood of your SAS help files. A fast machine with lots of RAM is recommended.

Download the script "install-backlink.js"install-backlinks.pl.txt

Q: What the heck is this thing for ?
A: Getting more out of your SAS Documentation.

Q: What does it do?
A: Analyzes SAS help files. Writes an index showing orphans, dead links and duplicate titled pages. Installs referer back links in each page of documentation.

Q: Why ?
A:

The impetus was lack of adequate navigation in SAS/AF help pages. My searches often end at an attribute or method page. The page would be displayed and have adequate content, but there are no navigation links to the class whose attribute or method I am looking at! Furthermore, the Locate button is unable to place the page in the the table of contents. Ouch!!

Another annoyance occurs when I find myself reading some 'page' of information at two different times. The 'page' is actually two different pages that are mostly subtley different (i.e. just the padding or margins are different). Other times the content is slightly different. So , for a given page title (or <H1>) I might have to deal with two different pages. I'm not sure why this is so, but it is aggravating.


After running the script you could end up with several tens of thousands of files. There is a good reason the help files were in chm form in the first place. I have not tried to recompile the html files back into chm files (at which point they would go 'native' with your SAS installation). A more detailed analysis of content and titles would have to be performed before attempting to insert new content based links or jump menus.

While I am on the subject of improved help files, SAS should investigate and promote a user community managed documenation scheme ( at a minimum something similar to user contributed notes as one might see at http://www.php.net/manual/en/ ). I would not recommend using the term siki for a SAS contented wiki.


From the script header

# Richard A. DeVenezia
# October 6, 2003
# http://www.devenezia.com
# Improve SAS Online help - tested with version 8 and 9
# Browse extractDir\index.html after running the script

my %param = ( modules    => "common af fsp"
            , extractDir => "c:\\temp\\sas"
            , noiseLevel => 9
            , pageLimit  => 0
            );

# This perl script was tested on a Windows 2000 machine
# perl for windows can be downloaded from http://www.activestate.com
# The source was edited using UltraEdit found at http://www.ultraedit.com
#
# I agree this script may contain stupid perl
#
# Runtime parameters
# -----
# modules    - space separated list of SAS help chm modules,
#              full list can be seen at !SASROOT\core\help
# extractDir - local path where chm help modules get decompiled
# noiseLevel - higher means more messages
# pageLimit  - 0 means process all files, otherwise process only
#              first N files of each module (only use N>0 when testing)
#
# module common is needed for style sheets
# the program adds an A{} block to make links more visible
#
#-------------------------------------
# What does the script do ?
#-------------------------------------
#
# Modify html files extracted from SAS chm files:
# - color links that refer to pages that refer back
# - link to __ALL__ pages that are referers
# - list dead links found on a page
# - indicate if the page is an orphan
# Generate an index that:
# - links to table of contents
# - links to list of keywords
# - lists orphan pages
# - lists pages with dead links
# - lists pages with duplicate titles
#
# Requires:
#   Access to registry to determine SAS installation location
#   Microsoft Html Help (hh.exe) so that .chm files can be decompiled.
# Following a link into an existing chm requires:
#   Internet Explorer with JavaScript enabled
#
#-------------------------------------
# Background
#-------------------------------------
#
# The SAS Online documentation is quite complete and very informative.
# That does not mean it can not be improved.  One area I find needing
# improvement is back links.  Often a keyword search will take me to a
# page that does not have information to allow it to be 'located' in
# the contents tree when the location button is pressed.  Nor is there
# a link to another page having 'parent' or 'aggregating' context.
#
# This is especially troublesome for AF programmers whose search places
# them at a methods or attributes page.  These pages do not have back
# links to the class containing the method (ouch!).
#
# I would prefer each page have to link to _every_ page that links to it.
# Doing so provides a much richer information net and lets me get a
# taste of the oosphere.
#
# So, I am addressing the situation
# A ---> B    ( A is a referer of B )
# by altering B so the relations are
# A <--> B    ( force B to be a referer of A )
#
# An even better improvement (not being done by this program) would be to
# ensure the forced back link goes to the point in A where B is first referred to
# I.e. in Page P: point of arrival from B = first point of departure to B
#
# A more difficult yet equally useful navigation change would be to
# enable some form of horizontal travesal. (Some sections of SAS help do
# exhibit this feature.)
#
# Consider:
#
#     A      level 1      A
#    /|\                 /|\
#   / | \               / | \
#  B  C  D   level 2   B--C--D
#
# I prefer all nodes on level 2 have links to every other node on level 2.
# At a minimum each node should provide a previous and next.  In terms of
# SAS/AF, it would mean when you are looking at method page, you are one
# or two clicks away from another classes method or attribute page.
# Anyway, that is for a later day...
#
# I studied the html files decompiled out of af and fsp chm and found
# several things:
#
# 1. very good consistency
# 2. consistency means simplistic pattern matching and replacement can
#    be used to extract information and manipulate the html files to my
#    own purposes.
#
#-------------------------------------
# What are the patterns ?
#-------------------------------------
#
# All link navigation is of form <A HREF="destination">information</A>.
# destination is of form MS-ITS:<module>.chm::/<module>.hlp/<some-destination>.
# The decompiled html files are placed in a <module>.hlp subfolder.
# Image SRC refer to a root absolute /<module>.hlp/images/ instead of
# relative ../<module>.hlp/images
#
# With such good consistency we can
#
# 1. make changes to HREFs so online help works in decompiled form
#    1a. some advanced mojo is used to change links to modules _not_ decompiled.
#        the links are changed to cause htmlhelp to open when the link is clicked.
#        the mojo only works in Internet Explorer browser.
# 2. determine incoming and outgoing links of each page for processing
#    I.E.
#    - if a page P has links incoming from A,B,C,X,Y and has outgoing links to B,C,X,Y,Z
#      I want to add to page P outgoing links to A and colorized the outgoing links
#      B,C,X,Y
#
#-------------------------------------
# How is the link data processed ?
#-------------------------------------
#
# Regular Expressions and Hashes!!!
#
# Each file in the <module>.hlp folders is scanned and information extracted.
# At the same time, 'fixes' are made to the links so they work in decompiled form.
#
# There will be three conceptual hashes maintained
# - pages - hash for page data, each file scanned has 'page data'
#   o page data - an array
#     - incoming, hash for page referers
#     - outgoing, hash for page href destinations
#     - title of page
#
# The data requires to passes
# pass 1. fix necessary links and record linkages
# pass 2. analyze linkages and update pages if necessary
#
# Once the data is in hashes, it is a relatively simple matter to
# perform all the interesting set analysis we want.
#
#-------------------------------------
# How big is this stuff ?
#-------------------------------------
# common, af and fsp ends up with ~7,500 files (36mb)
# and requires about 100 seconds to process when run on a
# Windows 2000 / Intel 3.06gHz / 1g ram / ata-100 system
#
# I have not tried recompiling the modified html back into
# chm files.