I learned from the Aegeanet bulletin board that
the Arachne CMS databases are online.
This includes a very large engraved seals database.
I have very little interest in engraved seals but
I did go to this site anyway and looked at it.
I learned that the Heidelberg University Seals Database numbers about
13000 entries. They do not allow access
to their database except through a browser.
To do serious research work you’d have to look at every one of the 13,000
entries and you’d have to painstakingly recreate your seals DB from theirs one
entry at a time. Let’s say you can do
two per hour. That amounts to 6500 man
hours. There are only 2250 man hours in
a work year. That means that anyone who
wants to use the Arachne seals DB for serious research has to spend 2.9 man
years or 34.67 months in analysis and db building. Along with all the errors that hand copying entails.
Which is an unnecessary time expenditure because anyone
who wants the Arachne seals database can have the entire thing with no more
time investment than a couple of days to write a piece of software.
How is that possible?
Well, first of all, it’s the internet and the
internet is notoriously open. It is open
especially for benign purposes which this is.
But, second, the Arachne people have a sloppiness in their
website which is that they show all the parameters to a page call right on
their url line. The result is that that
url can be programmatically manipulated to allow you to download one page after
another until you’ve downloaded every one of their pages. The whole download process (images and all)
would take, at most, a few minutes. Once
downloaded each page can be parsed for its constituent info (let me emphasize
that the information being dumped is exactly the information that the web site
is designed to provide in the first place and analyzing this page is exactly
what your favorite web browser already does).
After download a series of sql statements can be generated which can create a customized form of their database on the local machine. All of this can be done with a single program
so that the sequence is ‘web pages in; SQL statements out’.
It would require that the would-be downloader
write a program that is, in essence, a tiny customized browser that is designed
just to read Arachne web pages. That web
browser can then parse the string coming from Arachne and write out
sql statements that encapsulate the info of interest.
Once the sql statements are executed the database will be magically recreated
on the local machine.
But let’s get back to that part about the sloppiness. What does that look like?
A typical Arachne seals page |
In this picture we see a typical Arachne seals
page. The browser I’m using, just for the record, is
Chrome (Version 58.0.3029.110 (64-bit)). See that red box towards the bottom? That’s the information that you want to copy
and put in your own database. And the
maxim is that if something is visible on your computer then it’s ON your
computer and, if you can get access to it, you can do anything you want with
it. And, if you look at that little red
circle at the top, you’ll see the number 495 there followed by an ampersand
(&). If you could change that number
to 496 and reload the page then Arachne would serve the next image to your
machine. If you put that in a loop in a
program then Arachne would serve ALL the images to your machine. And I call this a sloppiness because those parameters don’t have to be
visible. I think that if Arachne really
wanted to force you to use an approved browser then they should have used the
POST method instead of GET for transmitting parameters. That would make the passed parameters invisible.
What about the information in the red box? How do you get hold of that? Let’s look at a piece of the page string that is served
by the Arachne servers. This is the exact same string served to your regular browser and to see this, just use your browser to go to one of the Arachne web pages and then press the right mouse button on your mouse. Then select 'View Source'.
...border="0" galleryimg="no"/></a><div style="font-size: 14px; padding: 3px; font-weight:bold;">Informationen</div><div class="info_wrap"><h3 class="info">Aufbewahrung: </h3><div class="info">Paläpaphos, Zypern, römisches Kammergrab.</div><div class="info">- Ortsangabe ist Fundort -</div><div class="info">Nikosia, Zypern, Cyprus Museum.</div><div class="info">- Ortsangabe ist Aufbewahrungsort -</div></div><div class="info_wrap"><h3 class="info">Herkunft: </h3><div class="info">Fundland: Zypern</div><div class="info">Fundort: Paläpaphos, CY</div><div class="info">Flur/Grabungsstätte: römisches Kammergrab</div><div class="info">Fundstelle: nein</div><div class="info">Platztypus: Grab</div><div class="info">Kontextdatierung: römisch</div><div class="info">Bereich: Zypern</div></div><div class="info_wrap"><h3 class="info">Form: </h3><div class="info">Siegelform: Lentoid</div><div class="info">Zahl der Siegelflächen: 1</div><div class="info">Wölbung der Siegelfläche: konvex</div><div class="info">Umriss der Siegelfläche: rund</div><div class="info">Dekor der Nebenseite: nein</div><div class="info">Art der Durchbohrung: diametral</div></div><div class="info_wrap"><h3 class="info">Material und Technik: </h3><div class="info">Material: Chalcedon?</div><div class="info">Bearbeitungstechnik: Rad, Rundperl, Tubus</div><div class="info">Sonstige Merkmale: nein</div></div><div class="....
Portion of the page string for an Arachne Seals web page
This string is the portion that creates the info
box which I outlined in red above. I
changed some of the fields to red (I think that there are about 40 fields of interest)
to show where they are. The first ‘Fundland’
shows that this seal was found in Cyprus.
The second ‘Fundort’ gives you the location of the find city, namely
Palaipaphos on Cyprus. And so
forth. (And somewhere in here is the link to the high-resolution image of the seal. Arachne snottily tells you that if you're just a casual user then you can only get the low-res version of the image - or so they think.) I emphasize that this page string is exactly
the string that your Chrome or Firefox browser actually is sent by Arachne and
from which your browser creates the web page that you see. We just need a more efficient way to extract
the data in a usable form and then create SQL statements out of it. That's what's nice about this. It is NOT a hack. It does NOT try to change their data. It does NOT attempt to modify their code to do something they don't want it to do. It's merely a different way of consuming the product that they already WANT to deliver.
Much of web programming is done in a language
called PHP. PHP supports a call (leaving
cURL completely out of the picture because we don’t need anything so
sophisticated) called file_get_contents().
The parameter to this call is the URL whose page you are
interested in acquiring. When you execute this call the server on the Arachne machine returns the string representation of the page. Your own
favorite browser does exactly this except with much more powerful
utilities. When you get the page string back
you can then do what I was suggesting which was to parse the page and write out the information in the form of SQL statements.
Writing out the SQL statements is trivial. The whole program looks like this:
establish first URL
MAIN LOOP
web_page =
file_get_contents(URL);
web_array
= explode('><div class="info">', Web_page);
FOR (every element of web_array)
{
IF start of element is a key word like ‘Fundort’
or ‘Fundstelle’ then extract the argument and store it in a unique variable, 1..n
}
Generate the SQL statement: e.g. “insert in seals values (‘Unique
var 1’, ‘Unique var 2’, … ‘Unique var n’);”;
END FOR LOOP
update the URL to point to the next page.
END MAIN LOOP;
It’s a little more sophisticated than that. The dates have to be extracted and placed in
their own table since there can be a variable number of dates for each seal. Basically, though, that’s all there is to
it. Now you may not want to do this but don't forget that this is always possible on the internet and for many more scholarly databases than just Arachne. These internet tools are far more powerful than the uninitiated suppose. What I suggest to Arachne is, if they really want to restrict access to only approved browsers, that they protect their URL line better.
And don't write back to me and say that this can't be done. I got my version of this program running along with the associated database in about
two days.
Just for the exercise.
Great!
ReplyDeleteHi Bender! Glad that you liked this.
ReplyDelete