Saturday, June 3, 2017

The Arachne CMS seals database

I learned from the Aegeanet bulletin board that the Arachne CMS databases are online.  This includes a very large engraved seals database.

I have very little interest in engraved seals but I did go to this site anyway and looked at it.  I learned that the Heidelberg University Seals Database numbers about 13000 entries.  They do not allow access to their database except through a browser.   To do serious research work you’d have to look at every one of the 13,000 entries and you’d have to painstakingly recreate your seals DB from theirs one entry at a time.  Let’s say you can do two per hour.  That amounts to 6500 man hours.  There are only 2250 man hours in a work year.  That means that anyone who wants to use the Arachne seals DB for serious research has to spend 2.9 man years or 34.67 months in analysis and db building.   Along with all the errors that hand copying entails.

Which is an  unnecessary time expenditure because anyone who wants the Arachne seals database can have the entire thing with no more time investment than a couple of days to write a piece of software.   

How is that possible?

Well, first of all, it’s the internet and the internet is notoriously open.  It is open especially for benign purposes which this is.  But, second, the Arachne people have a sloppiness in their website which is that they show all the parameters to a page call right on their url line.  The result is that that url can be programmatically manipulated to allow you to download one page after another until you’ve downloaded every one of their pages.  The whole download process (images and all) would take, at most, a few minutes.  Once downloaded each page can be parsed for its constituent info (let me emphasize that the information being dumped is exactly the information that the web site is designed to provide in the first place and analyzing this page is exactly what your favorite web browser already does).  After download a series of sql statements can be generated which can create a customized form of their database on the local machine.  All of this can be done with a single program so that the sequence is ‘web pages in; SQL statements out’.

It would require that the would-be downloader write a program that is, in essence, a tiny customized browser that is designed just to read Arachne web pages.  That web browser can then parse the string coming from Arachne and write out sql statements that encapsulate the info of interest.  Once the sql statements are executed the database will be magically recreated on the local machine.

But let’s get back to that part about the sloppiness.  What does that look like?

A typical Arachne seals page

In this picture we see a typical Arachne seals page.   The browser I’m using, just for the record, is Chrome (Version 58.0.3029.110 (64-bit)).  See that red box towards the bottom?  That’s the information that you want to copy and put in your own database.  And the maxim is that if something is visible on your computer then it’s ON your computer and, if you can get access to it, you can do anything you want with it.  And, if you look at that little red circle at the top, you’ll see the number 495 there followed by an ampersand (&).  If you could change that number to 496 and reload the page then Arachne would serve the next image to your machine.  If you put that in a loop in a program then Arachne would serve ALL the images to your machine.  And I call this a sloppiness because those parameters don’t have to be visible.  I think that if Arachne really wanted to force you to use an approved browser then they should have used the POST method instead of GET for transmitting parameters.  That would make the passed parameters invisible.

What about the information in the red box?  How do you get hold of that?  Let’s look at a piece of the page string that is served by the Arachne servers.  This is the exact same string served to your regular browser and to see this, just use your browser to go to one of the Arachne web pages and then press the right mouse button on your mouse.  Then select 'View Source'.

...border="0" galleryimg="no"/></a><div style="font-size: 14px; padding: 3px; font-weight:bold;">Informationen</div><div class="info_wrap"><h3 class="info">Aufbewahrung: </h3><div class="info">Paläpaphos, Zypern, römisches Kammergrab.</div><div class="info">- Ortsangabe ist Fundort -</div><div class="info">Nikosia, Zypern, Cyprus Museum.</div><div class="info">- Ortsangabe ist Aufbewahrungsort -</div></div><div class="info_wrap"><h3 class="info">Herkunft: </h3><div class="info">Fundland: Zypern</div><div class="info">Fundort: Paläpaphos, CY</div><div class="info">Flur/Grabungsstätte: römisches Kammergrab</div><div class="info">Fundstelle: nein</div><div class="info">Platztypus: Grab</div><div class="info">Kontextdatierung: römisch</div><div class="info">Bereich: Zypern</div></div><div class="info_wrap"><h3 class="info">Form: </h3><div class="info">Siegelform: Lentoid</div><div class="info">Zahl der Siegelflächen: 1</div><div class="info">Wölbung der Siegelfläche: konvex</div><div class="info">Umriss der Siegelfläche: rund</div><div class="info">Dekor der Nebenseite: nein</div><div class="info">Art der Durchbohrung: diametral</div></div><div class="info_wrap"><h3 class="info">Material und Technik: </h3><div class="info">Material: Chalcedon?</div><div class="info">Bearbeitungstechnik: Rad, Rundperl, Tubus</div><div class="info">Sonstige Merkmale: nein</div></div><div class="....

Portion of the page string for an Arachne Seals web page

This string is the portion that creates the info box which I outlined in red above.  I changed some of the fields to red (I think that there are about 40 fields of interest) to show where they are.  The first ‘Fundland’ shows that this seal was found in Cyprus.  The second ‘Fundort’ gives you the location of the find city, namely Palaipaphos on Cyprus.  And so forth.  (And somewhere in here is the link to the high-resolution image of the seal.  Arachne snottily tells you that if you're just a casual user then you can only get the low-res version of the image  - or so they think.)  I emphasize that this page string is exactly the string that your Chrome or Firefox browser actually is sent by Arachne and from which your browser creates the web page that you see.  We just need a more efficient way to extract the data in a usable form and then create SQL statements out of it.  That's what's nice about this.  It is NOT a hack.  It does NOT try to change their data.  It does NOT attempt to modify their code to do something they don't want it to do.  It's merely a different way of consuming the product that they already WANT to deliver.

Much of web programming is done in a language called PHP.  PHP supports a call (leaving cURL completely out of the picture because we don’t need anything so sophisticated) called file_get_contents().  The parameter to this call is the URL whose page you are interested in acquiring.  When you execute this call the server on the Arachne machine returns the string representation of the page.  Your own favorite browser does exactly this except with much more powerful utilities.  When you get the page string back you can then do what I was suggesting which was to parse the page and write out the information in the form of SQL statements.  Writing out the SQL statements is trivial.  The whole program looks like this:

establish first URL

web_page = file_get_contents(URL);
web_array = explode('><div class="info">', Web_page);

    FOR (every element of web_array)
    IF start of element is a key word like ‘Fundort’ or ‘Fundstelle’                 then extract the argument and store it in a unique variable,                1..n

Generate the SQL statement: e.g. “insert in seals values (‘Unique var 1’, ‘Unique var 2’, … ‘Unique var n’);”;


       update the URL to point to the next page.


It’s a little more sophisticated than that.  The dates have to be extracted and placed in their own table since there can be a variable number of dates for each seal.   Basically, though, that’s all there is to it.  Now you may not want to do this but don't forget that this is always possible on the internet and for many more scholarly databases than just Arachne.  These internet tools are far more powerful than the uninitiated suppose.  What I suggest to Arachne is, if they really want to restrict access to only approved browsers, that they protect their URL line better.

And don't write back to me and say that this can't be done.  I got my version of this program running along with the associated database in about two days.  

Just for the exercise.