Monday, July 25, 2022

Scholarly Citation of Digital Resources; Proofing your Site against Link Rot

 In my last posts, here and here, I started to deal with the idea of academic citation to digital resources.  I explained that Elliott's insistence that Pleiades was 'citation-ready' is nothing more than a server rewrite rule.  Such a rewrite rule does nothing for digital citation for the following primary reason:


A digital resource (online) is liable to change or disappear.  If they do this then the result is known as 'link rot'.  Wikipedia is probably our civilization's greatest example of exactly that.

If links do change or disappear then when a user tries to use a link YOU supplied he or she will get a 404 (page-missing) error.  That makes you look bad and seem a lot less reliable.  I gave an example of how link-rot has negatively impacted Peripleo which is Pleiades' flag-ship (if that's the word I want) product.

Is scholarly citation to digital resources even possible?  I admit that it's easier to cite a physical product such as a book or an article because, with few exceptions, they aren't going to disappear if you stop paying the fees.  

In this blog post I'm not going to deal with the citation of digital resources directly.  I'm going to deal with checking to see whether your links (embedded in your database) are still good.  You may not be able to do anything about link rot; that's in someone else's hands.  You can, however, regularly, check to see that your embedded links are still good.  If they're not then you can do something about it.  The real problem with link-rot is that it's an invisible process (invisible to you, that is).  I'm going to present a program that makes that process visible.

// Program chechLink
<?php

function checklink($l2ch)                    //    This routine uses cURL to get the headers back
{
    $ch                 = curl_init();
    curl_setopt($ch, CURLOPT_URL, $l2ch);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    curl_setopt($ch , CURLOPT_RETURNTRANSFER, 1);
    $data             = curl_exec($ch);
    $headers     = curl_getinfo($ch);
    curl_close($ch);
    return $headers['http_code'];
}

function  connectToDB()            // You have to write your own DB connect routine
{}

$link = connect();    // connect to the Database

//  The query retrieves the title of the work and the URL  It weeds out JSTOR links because I
// already know that these are 'stable'.
$query = "select Src, Tl, URL from biblio where URL is not null and URL not like '%jstor%';";
$result         = mysqli_query($link,$query);
$lcount = 0;

while($row = $result->fetch_assoc())
{
$lcount++;
$URL                = $row['URL'];  // Get the stored URL for this resource
$Tl                = $row['Tl'];  // Get the Title
$src = $row['Src']; // get my own personal DB code for this title

$check_url_status = checklink($URL);          // Call the checklink routine

echo "$lcount: Title: $Tl for URL: $URL\n URL status is: $check_url_status\n";  // make the URL and Title visible ...

// examine the result:
switch ($check_url_status)
{
    case 200 : {echo "Success";                              break;}
    case 201 : {echo "Created";                             break;}
    case 202 : {echo "Accepted but not complete"; break;}
    case 203 : {echo "Partial information"; break;}
    case 204 : {echo "No response";                        break;}
    case 301 : {echo "Moved and assigned a new URL.";   break;}
    case 302 : {echo "Resides under a different URL, however, the redirection may be altered on occasion";
break;}
 
  case 400 : {echo "Some kind of error";             break;}
    case 404 : {echo "Not found";                         break;}
default : {break;}
}

echo "\n\n";      // skip a couple of lines

}         // end of while loop

?>
Here's a trace of what it looks like when executing:

λ Php chlink.php
1: Title: Höhenheiligtümer und Schreine in Palästen und Siedlungen der Altpalastzeit Kretas. Ein Vergleich des rituellen Inventars for URL: https://core.ac.uk/download/pdf/18263645.pdf
URL status is: 200
Success

2: Title: for URL: http://www.archaeology.wiki/blog/2015/04/08/archaeology-tzoumerka-part-1/
URL status is: 301
Moved and assigned a new URL.

3: Title: 1. Geschichte der wissenschaftlichen Erforschung von Paros. for URL: https://www.google.com/books/edition/_/9iAKAAAAIAAJ?hl=en&gbpv=1&pg=PA366&dq=Avyssos,+Paros
URL status is: 302
Resides under a different URL, however, the redirection may be altered on occasion

4: Title: for URL: https://www.dainst.org/documents/10180/16114/00+JB+2010/93bf4ab7-e4c4-4614-9b1a-56c0d32ce8f8
URL status is: 404
Not found

5: Title: Archeologie au Levant for URL: https://www.persee.fr/issue/mom_0244-5689_1982_ant_12_1?sectionId=mom_0244-5689_1982_ant_12_1_1199
URL status is: 200
Success

6: Title: The Warrior Grave at Plassi, Marathon for URL: https://www.archaeology.wiki/blog/2017/04/06/the-warrior-grave-at-plassi-marathon/
URL status is: 200
Success

7: Title: Vlochos: Ruins of a city scattered atop a hill for URL: https://www.archaeology.wiki/blog/2018/09/14/vlochos-ruins-of-a-city-scattered-atop-a-hill/
URL status is: 200
Success

...

This program works as follows. It first connects to the Database. Following that it forms a query that retrieves the title (Tl) and the URL from the Biblio table in the database. It skips over JSTOR URLs because those are assumed to be stable. JSTOR provides non-changing stable links which is what I put in my database. This program then falls into a while loop which will process every URL I have in my database (some 1268 by current count); one at a time. It sends each URL to a routine called check link() which uses cURL services to get the headers back in an accessible form. This status goes into the $check_url_status variable. Then the $check_url_status is run through a PHP switch statement. If the value is 200 then it's successful (the URL is good) and no further action is needed. The other 2xx values are rare. The 3xx pair (301,302) usually means that the URL has been changed on the server side but that a rule was left behind about how to find the new resource. These usually do not cause problems. The 400 series means that the link is broken and some other action needs to be taken on your part to find the desired resource or the URL has to be deleted from the database.

Item no. 4 in the trace is like that:

"4: Title: for URL: https://www.dainst.org/documents/10180/16114/00+JB+2010/93bf4ab7-e4c4-4614-9b1a-56c0d32ce8f8
URL status is: 404
Not found"

This was an attempt to find a resource hosted on the German Archaeological Society's website (DAI). Whatever it was it has now been moved and it's unretrievable at the present URL. I'm either going to have to find that resource some other way or I'm going to have to delete this URL from my DB.

The other URL checks return a status of 200 which means that they succeeded and no action is needed. You should probably change the lines:

case 200 : {echo "Success"; break;}

to

case 200 : {break;} // or even remove this altogether

That way you'll only ever see the questionable URLs; this is probably why you are doing this in the first place.

You could modify this routine so that this line:

case 400 : {echo "Some kind of error"; break;}

is rewritten as:

case 400 : { echo "update biblio set URL = null where src = '$src' limit 1;";
break;}

This change will cause a string of these SQL 'updates' to be generated which can easily be assembled into a script.

If this utility is used judiciously it should help you to dramatically reduce link rot in your product and make it a more robust scholarly resource.

No comments:

Post a Comment

Stous Athropolithous

  (All references to Cnnn or Fnnn can be found in the Mycenaean Atlas Project site at helladic.info) I've been working through the list ...