PDA Full Site

Menu
Downloads | Project News | Project Page | Screen Shots | Demo Library | Docs | Background | Philosophy | Contact


Examples and Definition of the <gutenberg /> Tag

Tag: <gutenberg />

"Gutenberg" seems like a good enough name for the tag. It is unlikely to be used in any other XML markup, and since, at least at the beginning, it will apply almost exclusively to Project Gutenberg etexts, it seems like a reasonable name.

For the time being, the tag is required to occupy only one line. Nowadays, most text editors have line wrapping, so this should not be a problem. If you use vi or pico, well, use something else for this tag. Also, this tag should occupy the first line of the file. This is not a hard rule, but currently, the LIBREria script will probably be affected by improper placement of the tag (although I have not tested this).

I used all uppercase letters for the names of the attributes in this page. However, in the tag, they can be upper or lower case or any combination of the two. Personally, I prefer lower case, but people creating the tags should choose whichever is most convenient. For attribute arguments, only arguments that need to be case sensitive for some reason or another will be. For instance, the AUTHOR attribute must be case sensitive because the name must be a non-arbitrary combination of upper and lower case letters. On the other hand, the arguments for the toggles can be upper or lower case or even numerical (1 or 0). Case sensitive attributes are marked below.

Two attributes are required, and all the other attributes are optional. Optional attributes all have defaults, so it will only be necessary to set them if the default value is undesirable.

Because of the fact that new code to parse this tag has been added to the LIBREria script, this page has been provided as a starting place for a discussion of the necessary attributes such a tag would have. The code to parse it should be considered alpha, and will not work in many situations without tweaking. This new code and some other new code that has been added recently has made the script considerably less stable. Therefore, the latest version of the LIBREria script that parses this tag will not be posted until it is satisfactorily usable. If you would like a copy of the code as is, contact me through email, and I will send you a copy. The code for the actual subroutine will be posted soon (within the current week).

Required Fields:
  • TITLE
  • This attribute takes, obviously, the title of the book as its argument. The string can consist of letters, spaces, and punctuation, if necessary. The tag is required because the directory the book is put into will be named after it, and every chapter of the book will have the title at its head. CASE SENSITIVE.

  • AUTHOR
  • This attribute takes, again obviously, the name of the author. This is required as well because every chapter of the book will have the author's name at its head. CASE SENSITIVE.
Optional Fields:
  • CHAPTER
  • Default: "Chapter". This attribute is for variations in chapter naming schemes. Possible values are as follows:
    • Chapter (default)
    • CHAPTER
    • null
    • Stave
    • Capítulo
    This attribute can contain literally any one word string. Currently, I do not see any reason to make it support whitespace, so I will leave that for the future if the need arises and can be conclusively justified. CASE SENSITIVE.

  • MACRODIVISION
  • Default: null. This attribute is similar to the CHAPTER attribute, but it usually will contain entries like "Book" or "Part". Its deftault is null because books that contain larger divisions are in the minority. Enable it when the book has superchapter sections. CASE SENSITIVE.

  • THEMEDIR
  • Default: themes . This can be used to specify an alternate directory for css themes and images. This directory with be copied into the new book after parsing is finished. CASE SENSITIVE.

  • IMAGESDIR
  • Default: null. This is a directory for book specific images. If the book has corresponding images, the directory for those images can be specified with this attribute. Like THEMEDIR, this directory and all of its contents will be copied to the directory created for the new book and renamed "BOOK_TITLE/images". CASE SENSITIVE.

  • TABLESDIR
  • Default: null. If the book contains tables or hand formatted parts, you can set this attribute to point to the directory where the preformatted parts are stored. As above, the entire directory will be copied and renamed to "BOOK_TITLE/tables". CASE SENSITIVE.

  • AUTHORLINK
  • Default: null. If the author's name has a specific link, it can be included here.

  • NUMERALTYPE
  • Default: roman. If you need to specify the type of numerals used, use this attribute. Possible values are:
    • roman (I,II,III...) (default)
    • arabic (1,2,3...)
    • spelled (ONE,TWO,THREE...)
    • chinese (一,二,三...)
    • named (for individually named chapters)
    This attribute will only have to be set in very special cases, such as when the chapters are numbered but not named (as in 1984) or arbitrarily named (as in The Jungle Book).

  • MACRONUMERALTYPE
  • Default: roman. This is the same as NUMERALTYPE but for macrodivisions (Book, Part, etc.). Again, the program should determine this information automagically, but if it does not, it can be set with this attribute. The possible values are the same as above.

  • ATCONTENTS
  • This attribute is an array which can contain all of the chapter names if there is no consistent naming system. For instance, in the etext of The Jungle Book, the chapters are not named "Chapter 1, Chapter 2, etc." Instead, each chapter has a name as follows: "Mowgli's Brothers, Hunting-Song of the Seeonee Pack, etc." They are all arbitrary from the standpoint of a search algorithm. Therefore, this attribute can take a series of arguments in this fashion:
    ATCONTENTS="Mowgli&apos;s Brothers, Hunting-Song of the Seeonee Pack, Kaa&apos;s Hunting, Road-Song of the Bandar-Log, &quot;Tiger! Tiger!&quot;, Mowgli&apos;s Song, The White Seal, Lukannon, &quot;Rikki-Tikki-Tavi&quot;, Darzee's Chant, Toomai of the Elephants, Shiv and the Grasshopper, Her Majesty&apos;s Servants, Parade Song of the Camp Animals"
    Entities (specifically, &quot; and &apos;) must be used for quotes and apostrophies because of the nature of XML and HTML. If they are not used, parsers will not know when the attribute begins and ends. For now, commas will be used to separate fields. If there is a comma in a chapter title, use the entity "&#44;". CASE SENSITIVE.

Toggles (on/off):
  • LEGAL
  • Default: on. This toggle decides whether to separate out legal information distributed with the book. Project Gutenberg etexts are fairly well guaranteed to have this information included. Use this toggle to switch off the separation when using a file with no legal information included.

  • BLOCKQUOTING
  • Default: off. This code tries to recognize quoted sections by clues in the format. The code for this should be considered pre-alpha, and this toggle should be switched on only by the bravest of souls.

  • SUBHEADERTOGGLE
  • Default: on. This needs to be on for books with alternate names for chapters in situations such as:
    CHAPTER I

    Down the Rabbit-Hole
    where both lines have to be emphasised as the chapter header. It should be switched off when there is only the name of the chapter and the next paragraph is the text of the book.

  • MACROSUBHEADERTOGGLE
  • Default: off. This is nearly the same as the SUBHEADERTOGGLE attribute, but it applies to macrodivisions (Book, Part, etc.).

  • EMPHASIS
  • Default: off. This attempts to use HTML style emphasis for single words which are underscored (_word_) or all uppercase (WORD). This should not be turned on in books with lots of acronymns or books with all uppercase chapter names (ie: CHAPTER).

Tag Examples:

  • mohic10.txt: <gutenberg title="The Last of the Mohicans" author="James Fenimore Cooper" chapter="CHAPTER" />

  • jnglb10.txt: <gutenberg title="The Jungle Book" author="Rudyard Kipling" atcontents="Mowgli's Brothers, Hunting-Song of the Seeonee Pack, Kaa's Hunting, Road-Song of the Bandar-Log, "Tiger! Tiger!", Mowgli's Song, The White Seal, Lukannon, "Rikki-Tikki-Tavi", Darzee's Chant, Toomai of the Elephants, Shiv and the Grasshopper, Her Majesty's Servants, Parade Song of the Camp Animals" /> (For this tag to work, the table of contents in the text file must be removed.)

  • wizoz10.txt: <gutenberg title="The Wonderful Wizard of Oz" author="L. Frank Baum" chapter="null" numeraltype="arabic" />

  • lrngr10.txt: <gutenberg title="The Lone Star Ranger" author="Zane Grey" chapter="CHAPTER" macrodivision="BOOK" subheader="0" />



Code to Parse the <GUTENBERG /> Tag (Beta)



########################################
#  This parses the <gutenberg /> meta tag.  
#  This code is intended as a replacement for
#  hand keying of file attributes.  Every attribute
#  setting except for the filename should be able
#  to be included.
########################################

sub parsegutenberg ( $tag )
{

  print "    ###########################\n";
  print "    ##  Gutenberg Tag Found! ##\n";
  print "    ###########################\n\n";

  print "$tag\n";
  
  #  required tags
  
  if ( $tag =~ /title=\"(\S+)\"/i
  || $tag =~ /title=\"(\w(\w|\s|[-.:;'])+)\"/i )
#  ( $tag =~ /title=\"(.+)\"/i )
  {
    $title = $1; 
  
    print "title == $title\n\n";
  }
  
  else
  {
    warn "libreria.pl: This book has no title!!\n\n";
  }
  
  if ( $tag =~ /author=\"(\S*)\"/i
  || $tag =~ /author=\"(\w(\w|\s|\.|\[|\])+)\"/i )
#  ( $tag =~ /author=\"(.+)\"/i )
  {
    $author = $1;
    
    print "author == $author\n\n"
  }
  
  else
  {
    warn "libreria.pl: This book has no author!!\n\n";
    $author = "Anonymous";
  }
  
  # Optional tags
  
  if ( $tag =~ /chapter=\"null\"/i )
  {
    $chapter = "!null!";
    print "\$chapter == $chapter\n";
  }
  
  elsif ( $tag =~ /chapter=\"(中文:(.+)_(.{0,3}))\"/ )
  {
    $chapter = $1;
    $zhchapterbegin = $2;
    $zhchapterend = $3;
    print "\$zhchapterbegin == $zhchapterbegin
    \n\$zhchapterend == $zhchapterend\n\n";
  }
  
  elsif ( $tag =~ /chapter=\"([a-zA-Z](\w|\s)+)\"/i )
  {
    $chapter = $1;
    
    print "chapter == $chapter\n\n"
  }
  
  else
  {
    $chapter = "Chapter";
  }
  
  if ( $tag =~ /numeraltype=\"([a-z]+)\"/i )
  {
    $numeraltype = $1;
    
    print "numeraltype == $numeraltype\n\n"
  }
    
  if ( $tag =~ /macrodivision=\"(\w(\w|\s)+)\"/i )
  {
    $macrodivision = $1;
    
    print "macrodivision == $macrodivision\n\n"
  }
  
  elsif ( $tag =~ /macrodivision=\"null\"/i )
  {
    $macrodivision = "!null!";
  }
  
  elsif ( $tag =~ /macrodivision=\"(中文:(.+)_(.{0,3}))\"/ )
  {
    $macrodivision = $1;
    $zhmacrodivisionbegin = $2;
    $zhmacrodivisionend = $3;
    print "\$zhchapterbegin == $zhchapterbegin
    \n\$zhchapterend == $zhchapterend\n\n";
  }
  
  else
  {
    $macrodivision = "";
  }
  
  if ( $tag =~ /macronumeraltype=\"([a-z]+)\"/i )
  {
    $macronumeraltype = $1;
    
    print "macronumeraltype == $macronumeraltype\n\n"
  }
  
  else
  {
    $macronumeraltype = "null";
  }
    
  if ( $tag =~ /themedir=\"(\w(\w|\s)+)\"/i )
  {
    $themedir = $1;
    
    print "themedir == $themedir\n\n"
  }
  
  else
  {
    $themedir = "themes";
  }
  
  if ( $tag =~ /imgdir=\"(\w(\w|\s)+)\"/i )
  {
    $imgdir = $1;
    
    print "imgdir == $imgdir\n\n"
  }
  
  else
  {
    $imgdir = "images";
  }
  
  if ( $tag =~ /atcontents=\"(\w.*\w)\"/i )
  {
    my $atcontents = $1;
#    print "\$atcontents == $atcontents\n";
    @contents = split ( ", ", $atcontents );
#    print "\@contents == @contents\n";
    $anumeral = 1;

    foreach my $attribute ( @contents )
    {
      $attribute =~ s/'/'/g;
      $attribute =~ s/"/"/g;
      $attribute =~ s/,/,/g;
      print "\@contents: $attribute\n";
      
      $filenumber = filenumber ( $anumeral );
      $filename = ( "Chapter" . $filenumber . ".txt" );
      
      $contents_hash { $attribute } = $filename;
      $anumeral++;
    }
        
   @magcontents = @contents;
   $numeraltype = "named";
   $chapter = "!null!";

  }

  
  #  Toggles
  
  if ( $tag =~ /legal=\"(0|no|off)\"/i )
  {
    $legal = 0;
    
    print "legal == $legal\n\n"
  }
  
  else
  {
    $legal = 1;
  }
  
  if ( $tag =~ /subheadertoggle=\"(0|no|off)\"/i )
  {
    $subheadertoggle = "off";
    
    print "subheadertoggle == $subheadertoggle\n\n"
  }
  
  elsif ( $tag =~ /subheadertoggle=\"quote\"/i )
  {
    $subheadertoggle = "quote";
    
    print "subheadertoggle == $subheadertoggle\n\n"
  }
  
  else
  {
    $subheadertoggle = "on";
  }

  if ( $tag =~ /macrosubheadertoggle=\"(0|no)\"/i )
  {
    $macrosubheadertoggle = 0;
    
    print "macrosubheadertoggle == $macrosubheadertoggle\n\n"
  }
  
  else
  {
    $macrosubheadertoggle = 1;
  }

  if ( $tag =~ /compoundfile=\"(1|yes)\"/i )
  {
    $compoundfile = "yes";
    print "compoundfile == $compoundfile\n\n"

  }
  
  if ( $tag =~ /emphasis=\"(1|yes|on)\"/i )
  {
    $emphasis = "yes";
    
    print "emphasis == $emphasis\n\n"
  }
  
  else
  {
    $emphasis = "no";
  }
  
  if ( $tag =~ /paragraphization=\"(other|play|line)\"/i )
  {
    $paragraphization = $1;
    print "\$paragraphization == $paragraphization\n\n";
  }
  
  else
  {
    $paragraphization = "gutenberg";
  }
  
}








This site was generated with a derivative of the LIBREria script.
It was last updated on 2007:07:18.

Creative Commons License
This website and all works within it (except for the scripts themselves) are licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.