Monday, July 25, 2022

Scholarly Citation of Digital Resources; Proofing your Site against Link Rot

 In my last posts, here and here, I started to deal with the idea of academic citation to digital resources.  I explained that Elliott's insistence that Pleiades was 'citation-ready' is nothing more than a server rewrite rule.  Such a rewrite rule does nothing for digital citation for the following primary reason:


A digital resource (online) is liable to change or disappear.  If they do this then the result is known as 'link rot'.  Wikipedia is probably our civilization's greatest example of exactly that.

If links do change or disappear then when a user tries to use a link YOU supplied he or she will get a 404 (page-missing) error.  That makes you look bad and seem a lot less reliable.  I gave an example of how link-rot has negatively impacted Peripleo which is Pleiades' flag-ship (if that's the word I want) product.

Is scholarly citation to digital resources even possible?  I admit that it's easier to cite a physical product such as a book or an article because, with few exceptions, they aren't going to disappear if you stop paying the fees.  

In this blog post I'm not going to deal with the citation of digital resources directly.  I'm going to deal with checking to see whether your links (embedded in your database) are still good.  You may not be able to do anything about link rot; that's in someone else's hands.  You can, however, regularly, check to see that your embedded links are still good.  If they're not then you can do something about it.  The real problem with link-rot is that it's an invisible process (invisible to you, that is).  I'm going to present a program that makes that process visible.

// Program chechLink
<?php

function checklink($l2ch)                    //    This routine uses cURL to get the headers back
{
    $ch                 = curl_init();
    curl_setopt($ch, CURLOPT_URL, $l2ch);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    curl_setopt($ch , CURLOPT_RETURNTRANSFER, 1);
    $data             = curl_exec($ch);
    $headers     = curl_getinfo($ch);
    curl_close($ch);
    return $headers['http_code'];
}

function  connectToDB()            // You have to write your own DB connect routine
{}

$link = connect();    // connect to the Database

//  The query retrieves the title of the work and the URL  It weeds out JSTOR links because I
// already know that these are 'stable'.
$query = "select Src, Tl, URL from biblio where URL is not null and URL not like '%jstor%';";
$result         = mysqli_query($link,$query);
$lcount = 0;

while($row = $result->fetch_assoc())
{
$lcount++;
$URL                = $row['URL'];  // Get the stored URL for this resource
$Tl                = $row['Tl'];  // Get the Title
$src = $row['Src']; // get my own personal DB code for this title

$check_url_status = checklink($URL);          // Call the checklink routine

echo "$lcount: Title: $Tl for URL: $URL\n URL status is: $check_url_status\n";  // make the URL and Title visible ...

// examine the result:
switch ($check_url_status)
{
    case 200 : {echo "Success";                              break;}
    case 201 : {echo "Created";                             break;}
    case 202 : {echo "Accepted but not complete"; break;}
    case 203 : {echo "Partial information"; break;}
    case 204 : {echo "No response";                        break;}
    case 301 : {echo "Moved and assigned a new URL.";   break;}
    case 302 : {echo "Resides under a different URL, however, the redirection may be altered on occasion";
break;}
 
  case 400 : {echo "Some kind of error";             break;}
    case 404 : {echo "Not found";                         break;}
default : {break;}
}

echo "\n\n";      // skip a couple of lines

}         // end of while loop

?>
Here's a trace of what it looks like when executing:

λ Php chlink.php
1: Title: Höhenheiligtümer und Schreine in Palästen und Siedlungen der Altpalastzeit Kretas. Ein Vergleich des rituellen Inventars for URL: https://core.ac.uk/download/pdf/18263645.pdf
URL status is: 200
Success

2: Title: for URL: http://www.archaeology.wiki/blog/2015/04/08/archaeology-tzoumerka-part-1/
URL status is: 301
Moved and assigned a new URL.

3: Title: 1. Geschichte der wissenschaftlichen Erforschung von Paros. for URL: https://www.google.com/books/edition/_/9iAKAAAAIAAJ?hl=en&gbpv=1&pg=PA366&dq=Avyssos,+Paros
URL status is: 302
Resides under a different URL, however, the redirection may be altered on occasion

4: Title: for URL: https://www.dainst.org/documents/10180/16114/00+JB+2010/93bf4ab7-e4c4-4614-9b1a-56c0d32ce8f8
URL status is: 404
Not found

5: Title: Archeologie au Levant for URL: https://www.persee.fr/issue/mom_0244-5689_1982_ant_12_1?sectionId=mom_0244-5689_1982_ant_12_1_1199
URL status is: 200
Success

6: Title: The Warrior Grave at Plassi, Marathon for URL: https://www.archaeology.wiki/blog/2017/04/06/the-warrior-grave-at-plassi-marathon/
URL status is: 200
Success

7: Title: Vlochos: Ruins of a city scattered atop a hill for URL: https://www.archaeology.wiki/blog/2018/09/14/vlochos-ruins-of-a-city-scattered-atop-a-hill/
URL status is: 200
Success

...

This program works as follows. It first connects to the Database. Following that it forms a query that retrieves the title (Tl) and the URL from the Biblio table in the database. It skips over JSTOR URLs because those are assumed to be stable. JSTOR provides non-changing stable links which is what I put in my database. This program then falls into a while loop which will process every URL I have in my database (some 1268 by current count); one at a time. It sends each URL to a routine called check link() which uses cURL services to get the headers back in an accessible form. This status goes into the $check_url_status variable. Then the $check_url_status is run through a PHP switch statement. If the value is 200 then it's successful (the URL is good) and no further action is needed. The other 2xx values are rare. The 3xx pair (301,302) usually means that the URL has been changed on the server side but that a rule was left behind about how to find the new resource. These usually do not cause problems. The 400 series means that the link is broken and some other action needs to be taken on your part to find the desired resource or the URL has to be deleted from the database.

Item no. 4 in the trace is like that:

"4: Title: for URL: https://www.dainst.org/documents/10180/16114/00+JB+2010/93bf4ab7-e4c4-4614-9b1a-56c0d32ce8f8
URL status is: 404
Not found"

This was an attempt to find a resource hosted on the German Archaeological Society's website (DAI). Whatever it was it has now been moved and it's unretrievable at the present URL. I'm either going to have to find that resource some other way or I'm going to have to delete this URL from my DB.

The other URL checks return a status of 200 which means that they succeeded and no action is needed. You should probably change the lines:

case 200 : {echo "Success"; break;}

to

case 200 : {break;} // or even remove this altogether

That way you'll only ever see the questionable URLs; this is probably why you are doing this in the first place.

You could modify this routine so that this line:

case 400 : {echo "Some kind of error"; break;}

is rewritten as:

case 400 : { echo "update biblio set URL = null where src = '$src' limit 1;";
break;}

This change will cause a string of these SQL 'updates' to be generated which can easily be assembled into a script.

If this utility is used judiciously it should help you to dramatically reduce link rot in your product and make it a more robust scholarly resource.

Friday, July 22, 2022

 

Where broken links come from: the case of Peripleo


In my last post I suggested that rewrite rules were intended to make it easier for a user to search for an item on a web site.  The general concept looks like this:


Now, given this set-up we can change this (hypothetical) complex link:

{a} https://www.pleiades.org/stoa/locator-page-constructor?place=579885

the user would just type this on the URL line:

{b}  pleiades.stoa.org/places/579885

Of course that's not really true is it?  Just one difficulty: how would the user know the specific key (579885) for Athens? 

Rewrite rules are really intended for the convenience of the server-side developer.  In this case I'm talking about Pleiades but, in fact, there are probably millions of web sites (no exaggeration) which make use of rewrite rules for their own convenience.  For example, what  if I were to take my path:

{c}  https://www.helladic.info/MAPC/pkey_report_wparam.php?place=C4880

and change the program name to, oh, I don't know, let's say this:

{d}  https://www.helladic.info/MAPC/DEV/Places/key_param.php?place=C4880

In this case my original rewrite rule would still work.  The outer appearance of my site would not have changed.  If I made more radical changes it would still be o.k. as long as I came up with a new re-write rule.  This new rule should preserve my site's surface appearance to the outside world so that people invoking my site would not have to change their software or database or calling routines or all three.  So the purpose of a rewrite rule is to decouple my site's internal workings from the appearance which it presents to the outside world.

A failure to keep internal reality and outside appearance in synch is illustrated by the failure of all the links to the British Museum on Pleiades' Peripleo project.  When I search for 'kylix' for example I get this display:


The link to the British Museum (red arrow) does not resolve.  When I try it Chrome gives me this message:

"This site can’t be reached"

Now, to be fair, the  problem appears to be on the British Museum's side.  They have changed something in the path to their pictures but have not created a rewrite rule to preserve the previous appearance nor have they let Pleiades know about it.  Because this has never worked in the eight months that I've been trying it I find it curious, although not surprising, that Pleiades/Peripleo doesn't care enough to try to fix it.

After all, in the article I quoted from last time, Tom Elliott has this to say about URL/URIs:

"They’re cool because, if you construct them sensibly and connect them to interesting information and take care of them so they don’t rot into uselessness, they make citation happen." [1]

(emphasis mine)

Next time I want to address the idea of what it means to use digital resources for citation.

Notes
  1. Elliott [2018] 45.
Biblio
Elliott [2018]:  Elliott, Tom, 'The Pleiadic Gaze: Looking at Archaeology from the Perspective of a Digital Gazetteer',  Archaeology and Economy in the Ancient World; Classical Archaeology in the Digital Age – The AIAC Presidential Panel 12.1 (51), 43-51.  Kristian Göransson ed., 2018.  Online here.

Wednesday, July 20, 2022

The concept of citation-readiness and Mycenaean Atlas Project

 A URL  is a 'universal resource locator': it refers to a web site, e.g. pleiades.stoa.org.  A URI is a 'universal resource identifier': it refers to a specific item sheltered under the URL.  So, in the pleiades scheme, 579885 is a URI that refers to Athens.  And the link (URL + URI): pleiades.stoa.org/places/579885 is the pointer to 'Athens' in their database and which will be returned by Pleiades.   This is a very friendly form and easy for the user to remember.

The actual address of the Pleiades Athens page would not be friendly at all.  It's more likely to look something like this:

{a} https://www.pleiades.org/stoa/locator-page-constructor?place=579885

My example here is made-up but Pleiades' genuine locator is going to be very much like it and an awful lot for the user to remember.  Much easier to allow the user to write:

{b}  pleiades.stoa.org/places/579885

Now, to be clear, form {b} is for the user's convenience.  The server running at pleiades cannot use this string directly.  It has to convert this string into form {a} before the server can task the right resource.  A rewrite rule is what allows the server to do this.  The result is convenience for the user but correctness for the server.  Win-win.

Well, what one group can do another can also.  I created some rewrite rules for helladic.info so that, if you know the location key for a BA site you can just use that without  remembering the long pathname to the actual program.  So from now on you will be able to type and/or embed a simplified form.  E.g., such as here for the stadium at Argos:

{b}  www.helladic.info/C4480               

instead of the real full path:

{a}  https://www.helladic.info/MAPC/pkey_report_wparam.php?place=C4880

The rewrite rule itself is just this:

RewriteRule  ^(C[0-9]+)$  /MAPC/pkey_report_wparam.php?place=$1  [L,R]

... and the result looks like this:




In one of his foggier statements about Pleiades URLs (he clearly doesn't understand the technology), Tom Elliott has this to say about enhancing citation practice.

“On the world-wide-web, the identifiers necessary for citation should be front-and- center: they are the strings of characters that you put into the location bar of your browser in order to retrieve a web page.  They are the essential magic in a hyperlink.  Their technical name is “Uniform Resource Identifier,” a phrase usually abbreviated with  the acronym URI.  URIs (or yoo-ahr-ees, as they’re sometimes pronounced) are cool.  They’re cool because, if you construct them sensibly and connect them to interesting information and take care of them so they don’t rot into uselessness, they make citation happen.  In throwing off the normalizing tyranny of a single map view to embrace the radical equality of all places, Pleiades was born citation-ready.  Because Sean Gillies and others present at the creation payed attention to emerging best practice and cared about scholarly communication, Pleiades was born citation-ready.”[1]

(I removed Elliott's footnotes 17 and 18).

Elliott suggests that this practice makes pleiades 'citation-ready'.  'Citation-ready' in this sense requires, however, no special 'magic' (the word is Elliott's and it's one which he should avoid) because every server supports rewrite rules that will convert some simple address that you type into the URL box (here pleiades.stoa.org/places/579885) into the more complex 'actual' address where that resource resides.  As usual Elliott is erecting a cathedral on a postage stamp - overselling a normal web function.  If all it takes is a rewrite rule to make pleiades 'citation-ready' then helladic.info is now also 'citation-ready' ...

or something.

But is a simple rewrite-rule enough to create citation-readiness?  I'm not convinced.  We all could benefit from a discussion which tries to clear up the idea of scholarly citation in the context of on-line resources.  Maybe next time.

Notes
  1. Elliott [2018] 45-6.
  2. A guide to rewrite rules is here and a handy cheat sheet for regular expressions (RAs) is here.

Biblio

Elliott [2018]:  Elliott, Tom, 'The Pleiadic Gaze: Looking at Archaeology from the Perspective of a Digital Gazetteer',  Archaeology and Economy in the Ancient World; Classical Archaeology in the Digital Age – The AIAC Presidential Panel 12.1 (51), 43-51.  Kristian Göransson ed., 2018.  Online here.

It seems to me that this paper was actually written at least a decade ago but I can find no earlier reference to it.  Is it just me?  Does anyone else know better?

Sunday, July 10, 2022

Search in the Mycenaean Atlas Project

 The ability to search is a large part of the Mycenaean Atlas.  Our search has changed a great deal over the years and so it's time for a new and detailed description.


All the pages in the M.A.P. have a search box.  It looks like this:


In the M.A.P. search will do three things:

I. Locate a point on the map
II. Return a list of sites which match a search string
III. Return a map which shows sites that match a search string

I. Locate a point on a map.
If you enter a lat/lon pair separated by a comma then Search returns a map with your input location marked by a cross. 
[Map with cross, 37.2, 22.2]




You'll see that this map display also shows a thumbnail map which is zoomed out enough so that you stay oriented about where in the Mediterranean you are.  Your current location is indicated by the tiny red diamond icon at the center.
Your search lat/lon pair can also have degree symbols.  Those will be ignored.  Something like this:


If there are M.A.P. sites or features in the vicinity then those will be shown as well.  Here there happens to be a cult site/icon at the center of the map. These icons are also interactive.  You can mouse  over them or click on them.  If you click on one you'll see and info-box which contains a  link to the full description of the site or feature.

One limitation is that only M.A.P. sites and/or features are shown.  Sites in the other supported DBs like  Pleiades or Topostext are not returned in this simple lat/lon search.

This lat/lon search is highly customizable by  supplying additional arguments.  These (in order!) are as follows:

  • Lat
  • Lon
  • Zoom Level from 2 to 18 (default zoom level is 15)
  • Frame size in decimal degrees from 0.01 to 0.30 (the default is 0.15 decimal degrees)
  • A frame size of 0.01 is 1.11 km. wide.
  • A frame size of 0.30 is 33.4 kilometers wide
  • The default frame size of 0.15 decimal degrees is 16.7 kilometers wide.

You must always supply a lat/lon pair but you can also specify the zoom level, framesize, both, or neither.

  • Just zoom: nn.nn , nn.nn , z[z]
  • Just frame size: nn.nn, nn.nn , , 0.f[f] ----> notice that the comma for the zoom argument is still required.
  • Both: nn.nn , nn.nn , z[z] , 0.f[f]
  • Neither: nn.nn , nn.nn

nn.nn indicates latitude or longitude values; latitude comes first.

z[z] indicates a 1 or 2 digit zoom integer value from 2 through 18

0.f[f] indicates a frame size from 0.01 to 0.3. This includes such values as 0.08, 0.14, 0.25, etc.

Do not enter html tags or special characters.


II. Return a list of sites which match a search string.

If your search string is NOT in the format lat, lon then the software conducts a traditional search.  Here's a search for the string 'Elias'.


If the List radio box is checked then your results will be returned as a list of sites and clickable place keys.  

 Here is the first page of results returned when we search for 'Elias'.


These returned sites are those which, somewhere in their names or detailed descriptions, contain your search string.  The returned list is presented in sections. [In the following list SS stands for 'Search String']

M.A.P. site name.
Find Type Comment
Site Bibliography section title.
The Period section Comment
The site General Note
Feature Name
SS appears in the author name in a bibliography entry
Title or Publisher name of a bibliography entry
Pleiades ID name
Pleiades descriptive label
Topostext site name
deGrauuw's Harbor Name
TrisMegistos Name
Vici.Org Name
D.A.R.E. Name

Each list entry returned gives a link to a detail page for that site, Site Name, String containing the SS (in which the SS appears like this '[SS]' in upper case), and a link to the Digital Atlas which centers the specific matched site.

III. Return a map which shows sites that match a search string

In the Greek-speaking world there are many places which have names based on Hagios Elias.  When I search for 'Elias' by list I get 88 results.  But which is the one I really want?  Perhaps I'm really interested in 'Mar Elias' which is a monastery somewhere in Israel.  If we could see all returned results on a map we could quickly sort out the right one.  We check off the 'Map' radio button and do the search.  The results look like this:



On the map we see that the only 'Elias' in Israel (in all the databases) is the monastery 'Mar Elias' which is in Jerusalem.  It happens that the Digital Atlas of the Roman Empire (DARE) has a detailed record about it which you can access from the 'D' icon.  In the next picture you see the drop-down menu of the several databases.  I have turned them all off except for D.A.R.E.






Locating a Late Minoan Settlement near Prina on Crete (C7884)

In Hayden [2005] there is a description of a Late Minoan settlement in the Vrokastro area of Crete.  The site sits just below the western bo...