Sophie: sitescooper-3.1.2-7mdv2008.1 noarch

sitescooper-3.1.2-7mdv2008.1.noarch.rpm

<HTML>
<HEAD>
   <TITLE>Running Sitescooper</TITLE>
</HEAD>
<BODY TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#3300CC" VLINK="#660066">

<H1>Running Sitescooper</H1>

<ul>
<li><a href=#thecommand>Running the Command Itself</a></li>
<li><a href=#sitesdir>The Sites Directory</a></li>
<li><a href=#sitesoncmdline>Selecting Sites On The Command Line</a></li>
<li><a href=#scoopaurl>Scooping a URL Without a Site File</a></li>
<li><a href=#nottooclever>Stopping Sitescooper From Being Too Clever</a></li>
<li><a href=#nottooclever2>Stopping Sitescooper From Being Too Clever, pt. 2</a></li>
<li><a href=#filesizelimit>What's This "File Size Limit Exceeded" Message?</a></li>
<li><a href=#outputformat>Changing Output Format</a></li>
<li><a href=#profiles>Selective Scooping Using Profiles</a></li>
<li><a href=#whatsgoingwrong>What's Going Wrong?</a></li>
<li><a href=#gettingtheoutput>Getting The Output</a></li>
<li><a href=#outputrenaming>Output Renaming</a></li>
</ul>

<a name=thecommand>
<hr><H2>Running the Command Itself</H2>

The easiest way to get started with sitescooper is to simply run it:
<UL><B>perl sitescooper.pl</B></UL>
(UNIX users can leave out the <B>perl</B> command at the start of the line,
but will have to provide the correct path to the <B>sitescooper.pl</B>
script. Linux users installing using the RPM get sitescooper in their path,
so they can just type <B>sitescooper</B> to run it.)

<P>The first time you run sitescooper, it will pop up a list of sites in
your editor, and ask you to pick which sites you wish to scoop. This creates
a file in your temporary directory called <B>site_choices.txt </B>with
these choices. Your temporary directory is the <B>.sitescooper </B>subdirectory
of your home directory on UNIX, or <B>C:\Windows\TEMP </B>for Windows users;
this can be changed by editing the built-in configuration in the script.

<P>Once you've chosen some sites, it'll run through them, retrieve the
pages, and convert them into <B>iSilo</B> format. See <B>Changing Output
Format </B>if you wish to change this.
<p>
<a name=sitesdir>
<hr><H2>The Sites Directory</H2>
Versions of sitescooper before 2.0 used a different mechanism to choose
sites; instead of picking them from a list, you had to copy them manually
from the <B>site_samples</B> directory into a <B>sites</B> directory, and
sitescooper would use the site files in that directory. This is still supported
in 2.0; if there are any site files in the <B>sites</B> directory, they'll
be read and those sites will be downloaded when you run sitescooper. If
you're a pre-2.0 user and don't want to keep doing things that way, just
delete those files.
<p>
<a name=sitesoncmdline>
<hr><H2>Selecting Sites On The Command Line</H2>
If you want to scoop from one specific site, you can use the <B>-site</B>
argument to do this. Provide the path to the site file and sitescooper
will only read that one site.
<UL><B>perl sitescooper.pl -site site_samples/web/alertbox.site</B></UL>
Multiple sites can be chosen by providing multiple <B>-site</B> arguments,
one before each site name, or by using the <b>-sites</b> switch:
<UL><B>perl sitescooper.pl -sites site_samples/web/*.site</B></UL>
<p>
<a name=scoopaurl>
<hr><H2>Scooping a URL Without a Site File</H2>

Let's say you want to scoop a URL which doesn't have a site file written
for it. Run sitescooper with the URL on the command line and it will scoop
that URL, tidy up the HTML as best it can without knowledge of the site's
layout, and convert it into the chosen format.
<UL><B>perl sitescooper.pl http://jmason.org/</B></UL>
You can even provide a limited form of multi-level scooping with this,
using the <B>-levels</B> and <B>-storyurl</B> arguments:
<UL><B>perl sitescooper.pl -levels 2 -storyurl 'http://jmason.org/.*\.html'
http://jmason.org/</B></UL>
Personally, I think this is only handy when prototyping a site file, but
it's possible nonetheless.
<p>
<a name=nottooclever>
<hr><H2>Stopping Sitescooper From Being Too Clever</H2>

Sitescooper includes logic to avoid re-scooping pages you've already read.
Sometimes, however, you will want it to do so; if this is the case, use the
<B>-refresh</B> argument. This will cause sitescooper to ignore any historical
accesses of the site and will scoop the lot.  However it will use any cached
pages it has already loaded.  <UL><B>perl sitescooper.pl -site
site_samples/web/alertbox.site</B> <BR><B>perl sitescooper.pl -refresh -site
site_samples/web/alertbox.site</B></UL> This is <I>very</I> handy when you're
writing a site file.

<p>
<a name=nottooclever2>
<hr><H2>Stopping Sitescooper From Being Too Clever, pt. 2</H2>

<B>-refresh</B> uses the cached pages where possible. This is not always what
you want, as the page may have changed since the last load, but the cached copy
has not expired.  To avoid this, use the <B>-fullrefresh</B> argument. This
will cause sitescooper to ignore any historical accesses of the site and will
scoop the lot, ignoring any cached pages and reloading every page from the source
(unless the <B>-fromcache</b> argument is used).

<UL><B>perl sitescooper.pl -site site_samples/web/alertbox.site</B>
<BR><B>perl sitescooper.pl -fullrefresh -site site_samples/web/alertbox.site</B></UL>
<p>
<a name=filesizelimit>
<hr><H2>What's This "File Size Limit Exceeded" Message?</H2>
Sitescooper imposes a limit of 300 kilobytes on the HTML or text scooped
from any one site; otherwise it's quite easy to produce site files which
can generate a 800K PRC file in one sitting!

<P>By the way, note that the resulting PRC files may be well under 300Kb in
size; sitescooper imposes the limit on the raw HTML or text as it goes along,
and it's entirely plausible that the conversion tools used might do a great job
of compressing the data.

<P>Also it should be noted that often, when you hit the limit on a site,
the missed stories will simply be scooped next time you run the script.
This depends on the site file though.

<P>If you want to increase the limit, use the <B>-limit</B> argument:
<UL><B>perl sitescooper.pl -limit 500</B></UL>
will scoop your chosen sites, with a limit of 500Kb.
<p>
<a name=outputformat>
<hr><H2>Changing Output Format</H2>
Currently, these are your options. The command-line switch is provided
in bold text after the description.
<UL>
<LI><P>
<B>-text</B>:
plain text, with all the articles listed one after the other</P></LI>

<LI><P>
<B>-doc</B>:
DOC format, for reading with AportisDoc or another DOC reader on a Palm
handheld. This is essentially plain-text format converted using MakeDoc.

<LI><P>
<B>-html</B>:
HTML format, with all the indexes and articles listed one after the other
in one file, with hyperlinks between them</P></LI>

<LI><P>
<B>-mhtml</B>:
M-HTML (multiple-page HTML) format, which is the same as HTML except it
separates the stories and indexes into separate files with hyperlinks between
them</P></LI>

<LI><P>
<B>-isilo (the default format)</B>:
iSilo format, for reading with iSilo on a Palm handheld. This is HTML converted
using the iSilo conversion tool</P></LI>

<LI><P>
<B>-misilo</B>:
M-iSilo format, iSilo format files with multiple pages. This is an M-HTML
article converted using the iSilo tool</P></LI>

<LI><P>
<B>-plucker</B>:
Plucker format, for reading with Plucker on a Palm handheld. This is HTML
converted using the plucker-build conversion tool</P></LI>

<LI><P>
<B>-mplucker</B>:
Multi-page Plucker format, for reading with Plucker on a Palm handheld. Again,
the plucker-build conversion tool is used. This format is recommended over
the -plucker format, as Plucker has a page-length limitation, which results
in pages being split in an ugly fashion.</P></LI>

<LI><P>
<B>-richreader</B>:
RichReader format, for reading with RichReader on a Palm. This is HTML
converted using the RichReader conversion tool</P></LI>

</UL> It's also possible to run any command you like to convert the resulting
output; see the <a href=sitescooper.html#item__pipe>documentation for the
<B>-pipe</B> switch</a> if you're interested in this.

<P>If you want to convert to multiple output formats, you need to run sitescooper
once for each output format, and use a shared cache between the separate
invocations. Ask on the mailing list for more information on this.
<p>
<a name=profiles>
<hr><H2>Selective Scooping Using Profiles</H2>

Story profiles are a way of scooping sites for a particular set of words or patterns.
If the words in question don't appear in a story, that story will not be
scooped.  (This functionality was contributed by James Brown &lt;jbrown /at/
burgoyne.com&gt; - thanks James!)
<p>

Here's a sample profile file, as an example:
<pre>
	Name: Bay Area Earthquakes

	Score: 10

	Required: san jose, earthquake.*, (CA|California)

	Desired: Richter scale, magnitude, damage, destruction, fire,
		flood, emergency services, shaking, shock wave

	Exclude: soccer
</pre>

And here's James' description of the format:

<BLOCKQUOTE>
A profile contains the following:
<ul>
<li><p>
        Name - appears with the output story to identify which profile was matched
(required)
</p></li>
<li><p>
        Score - indicates how well the profile has to match (higher numbers filter
out more stories (optional: default 30)
</p></li>
<li><p>
        Required - words that are required to be in the story or it doesn't match
(semi-optional)
</p></li>
<li><p>
        Desired - words that might be in the story for it to match (semi-optional)
</p></li>
<li><p>
        Excluded - words that can not appear in the story, otherwise it doesn't
match (optional)
</p></li>
</ul>
Obviously, one or both of 'Required' or 'Desired' must be present or it wont
match anything.
<p>
The score is
a minimum value that must be matched (basically a percentage of keyword hits
vs. number of sentences).  The required keywords must be present or the
story does not match.  The desired keywords give hits about what is
interesting.  The more desired keywords that match, the higher the story
scores.  The exclude keywords will cause a story not to match if they are
present.
<P>
All of the keywords (required, desired, exclude) can be phrases and all are
processed as PERL regular expressions so they can be quite complex if
needed.  Keywords are separated by either a comma or a newline.  Scouts.nhp
is probably the richest example of what can be done with a profile (includes
regular expressions).
<P>
I added an "IgnoreProfiles" command to the site file definition to allow
users to scoop the entire site rather than just the stories that match.
<P>
</BLOCKQUOTE>

To turn on Profile mode, use the <B>-grep</B> argument when running sitescooper;
any sites that do not contain <B>IgnoreProfiles: 1</b> will then be searched
for the active profiles.
<p>

To use a profile, create a directory called <b>profiles</b>, and set the
<b>ProfilesDir</b> parameter in the <b>sitescooper.cf</b> file to point to that
directory.  Now copy in the profiles you are interested in from the
<b>profile_samples</b> directory of the distribution.  UNIX users should look in
/usr/share/sitescooper or /usr/local/share/sitescooper if you're not sure where
sitescooper has been installed.  Edit the profiles to taste, and run

<UL><B>perl sitescooper.pl -grep</B></UL>

You can also use the <b>-profile</b> or <b>-profiles</b> switches to specify
individual profiles you wish to use, without requiring a <b>ProfilesDir</b>
directory to be set up. These switches have the same semantics as the
<b>-site</b> and <b>-sites</b> switches.
<p>

<p>
<a name=whatsgoingwrong>
<hr><H2>What's Going Wrong?</H2>
If sitescooper is acting up and not doing what it's supposed to do, try
the <B>-debug</B> switch. This will turn on lots of optional debugging
output to help you work out what's wrong.
<UL><B>perl sitescooper.pl -debug</B></UL>
This is <I>very very</I> handy when writing a site file.

<P>There's also a <B>-nowrite</B> argument which will stop sitescooper
from writing cache files, <B>already_seen</B> entries, and output files.

<P>If the worst comes to the worst, you can get sitescooper to copy the
HTML of every page accessed to a journal file using the <B>-admin journal</B>
switch. This HTML is logged, first, in its initial form straight from the
website, secondly, after the StoryStart and StoryEnd has been stripped
from the page, and finally, as text. This is handy for debugging a site
file, but is definitely not recommended during normal use, as a big site
will produce a <I>lot</I> of journal output.

<P>If you have all the files in your cache, use the <B>-fromcache</B> switch
and network accesses will be avoided entirely. This is handy for debugging
your site offline, or for producing output in multiple formats from the
same files, if you have a shared cache set up.
<p>
<a name=gettingtheoutput>
<hr><H2>Getting The Output</H2>
Normally, the output from sitescooper is written to the installation directory
of your Pilot Desktop software, where possible.

<UL>
<LI><P>
On UNIX, it'll be copied into the <B>~/pilot/install</B> directory if it
exists, or <B>~/pilot </B>otherwise. Users of gnome-pilot, PilotManager, or
JPilot should enter <b>gnome-pilot</b>, <b>PilotManager</b>, or <b>JPilot</b>
in the <b>PilotInstallApp</b> field of the configuration, as sitescooper
includes built-in support for those tools.  (If you use KPilot, mail me and
tell me where it should go!)</P></LI>

<LI><P>
Windows users have it easy, as sitescooper will automatically install
PRC files into the Pilot Desktop <b>Install</b> directory.  If you use
multiple Palm devices however, you will still need to edit the configuration
to name the correct directory.
</P></LI>

<LI><P>
Mac users get the worst of all worlds, as the output is simply left in
the <B>txt</B> sub-folder of the temporary folder. They need to copy it
over themselves manually (sorry).</P></LI>
</UL>

If you want the output directly, use the <B>-dump</B> switch. This will
cause readable formats (text, HTML) to be dumped directly to standard output,
i.e. to the screen, or to a file if you've redirected stdout.<B></B>

<P><B>-dumpprc</B> does the same thing for the binary formats, such as DOC,
iSilo, M-iSilo or RichReader. Note that multi-file formats such as M-HTML don't
get dumped either way; the <I>path</I> to the file which contains the first
page of the output is printed instead.

<P>Some versions of Windows perl have difficulty redirecting stdout, so
the <B>-stdout-to</B> argument allows the same thing to be done from within
the script itself.
<p>
<hr><H2>The HTML Output Index</H2>

Scoops generated using <b>-html</b> or <b>-mhtml</b> get a bonus feature; an
index page will be generated in the <B>txt</B> sub-folder of the temporary
folder, listing all currently-existing HTML output.

<p>
<a name=outputrenaming>
<hr><H2>Output Renaming</H2>
By default, the output files are generated with the current date in the
filename and in the title used in the PRC file (when a PRC file is generated).
Use the <B>-prctitle</B> and <B>-filename</B> arguments to change this
behaviour. More information on this can be found in the command-line reference
documentation.<p>

<!-- start of nav links --><hr>
<p align=right>
<nobr> [
<a href=index.html>README</a> ]
<br>
[
<a href=installation.html>Installing</a> ]|[
<a href=unix_install.html>on UNIX</a> ]|[
<a href=windows_install.html>on Windows</a> ]|[
<a href=mac_install.html>on a Mac</a> ]
<br>
[
<a href=running.html>Running</a> ]|[
<a href=sitescooper.html>Command-line Arguments Reference</a> ]
<br>
[
<a href=writing_site.html>Writing a Site File</a> ]|[
<a href=site_params.html>Site File Parameters Reference</a> ]
<br>
[
<a href=rss-to-site.html>The rss-to-site Conversion Tool</a> ]|[
<a href=subs-to-site.html>The subs-to-site Conversion Tool</a> ]
<br>
[
<a href=contributing.html>Contributing</a> ]|[
<a href=gpl.html>GPL</a> ]|[
<a href=http://sitescooper.org/>Home Page</a> ]
</nobr>
</p>
<!-- end of nav links --> </body></html>