Monday, March 28, 2022

Do Not Use the Pleiades Data Set


The river Quartaccio is a brook (described as 'fossa') just near Marina di San Nicola on the Lazio coast of Italy. This brook diverges from the Statua (another brook) at 41.9298 N, 12.135 E and flows N towards the town of Quartaccio and Valcanneto. Along this brook at about 41.9356 N, 12.1381 E a modern road crosses it. This is approximately the position of a Roman bridge which is given no name. This bridge is marked at this place in the Barrington Atlas (Map 43, A2, no. 2).   Here's what it looks like in Google Earth:

All in all it took me about half an hour to trace down the right Quartaccio (there are several) and take a stab at the position of the bridge based on the Barrington Atlas.

In Pleiades this bridge is no. 426578 and its position is given by Pleiades as 41.875 N, 12.125 E which is in the Tyrrhenian some 3.69 km. from the nearest land and 6.83 km from the putative bridge.  See it at the bottom of the next photo:

Now these gross errors of placement in the Pleiades data are nothing new. I looked at five examples last time and I attributed them to careless digitization and the complete lack of any quality control.   I did notice that Pleiades puts their accuracy estimate for the Quartaccio Bridge as 'Rough'. I thought, 'Aha!',  if I can see what 'Rough' means to the Pleiades group then maybe I can see  how far the  problem goes.   I got the 'Rough' points out of the database that I made from Pleiades' data and plotted that on Google Earth. There are about 5000 points labelled 'Rough' in the Pleiades DB. I used a select to retrieve the first 3000 of these (about 1/10 of the entire Pleiades DB). Here's the result of plotting these 3000:

This is shocking.  

Even if the location of these sites was not known (most of them are known as I will show) Pleiades has made no effort to place them even approximately.  These points are simply dumped off at the nearest half-degree or quarter-degree vertex.  Nothing else could make such a pattern.

And most of these points are not singles.  They are multiples of 5 to 10 distinct locations which are simply dumped on top of each other. All these sites have simply been rounded to the nearest degree or half-degree. And most of these points are easily locatable. When I was criticizing P. for errors I truly believed that the number was, perhaps, about 100 points. Now I see that it is thousands of points that could be known but which are simply dumped off at the nearest vertex.

Greece looks like this:

I've said that many of these points are just carelessly dumped in a pile even though the locations they denote are easily locatable.  Here's an example.  In the Corinthian Gulf Pleiades has overlain six points at 38.25 N, 22.25 E.

I've drawn red lines from the vertex where they were dumped to their real locations.  The line drawn to Phocis is drawn to the nearest point of land in Phocis.  The lines from the  rivers are drawn from the mouths of the respective rivers.  The Styx is the modern Mavronero above Solos.  The village of Solos is marked.  Every point was easily locatable and the average error is 13.9 km.  

In Pleiades-world four rivers, an entire province, and a gulf are all in exactly the same location.   All of their 'rough' points are exactly the same.

I can think of no toponymical practice that can explain this pattern.  What's absolutely clear is that the Pleiades data set has no conceivable use for any scholarly purpose.   

The Pleiades data set is approaching twenty years of existence.  In that time the P. team had plenty of time to fix these gross errors.  I concluded long ago that Pleiades never had any interest in creating a scholarly (or even a reliable) map of the ancient world.  Soon I will dedicate one of these posts to discussing what I think their real interest is.

The Pleiades data is unsuitable as a reference.  

It is unsuitable for study of the Ancient World.  

It is unusable as a basis for developing further geographic tools.

Pleiades owes the scholarly community an explanation of why their data so bad and what they intend to do to rectify this very poor database.

Monday, March 7, 2022

Lies, Damn Lies, and Digitization

So about the year 2000 the data which now makes up the Pleiades 'database' was created by digitization either from Barrington Atlas maps directly or from maps that were contributory to the Barrington Atlas.  This data has not been looked at, proofed, or corrected since that time.

How do I know?

Let's take a look.

On the coast of southern Laconia a small peninsula juts out into the Aegean.  For centuries it has been the home of the community of Monemvasia - a world-renowned tourist destination.  In antiquity it was called the promontory of Minoa.  Its location is 36.683 N and 23.05 E.  I show it here as it appears  in the Barrington Atlas.

Pleiades has 'Minoa Prom.' in a very different place: 36.75 N and 23.25 E.  How far is the real Minoa Pr. from the Pleiades marker?  More than eighteen kilometers and smack dab in the middle of the ocean.  Here it is in Google Earth:

The distance between the two is more than 18 km.

Now I'm not picking on Pleiades because of this one error.  There are lots of errors in my own Mycenaean Atlas.  I have made errors of commission, omission, faulty reasoning, and pure ignorance.  I know this because I sometimes catch them.  This error of Pleiades', however, is something different.  Failure to see this glaring digitizing error results from never actually working with their own data - a dataset which has been in existence for more than twenty years.  

There are more examples of failure to catch digitization errors:

Pleiades 603253 is described as "An ancient place, cited: BAtlas 61 E4 unnamed quarry (on Horomedon M. on Kos)"   Pleiades locates it in the middle of the strait separating Kos from the Turkish mainland.  This is about 8.7 km distant from its true location which is shown on the Barrington Atlas at about  36.8312 N,  27.234 E.  This is how it looks in Google Earth:

Mt.Dhikeos on Kos is the ancient Horomedon.  The radius of the circle centered on Pleiades 603253 and extended to the quarry's approximate location in the Barrington Atlas  is ~8700 m.  

Pleiades 570688 is called "Spiraion Pr." and placed at 37.75 N, 23.25 E, in the middle of the Saronic Gulf.  The true location of Spiraion (the modern Spiri) is 37.8025 N, 23.1754 E, about 8500 m. distant.  Topostext (green arrow) places it correctly.

The Mycenaean Atlas Project's new Digital Atlas of Antiquity.
Pleiades position (red arrow) is about 8500 m. from the
true position (green arrow) where the Topostext marker is.

The ancient Grotta was located where the Chora of Naxos is located now, at 37.1084 N, 25.3748 E.  Pleiades (599630) has it in the middle of the bay at 37.1129,  25.3783 just over 500 m. distant.  This error, though very minor, is a particular clue because it was digitized accurately.  The Barrington Atlas shows it in the middle of the Bay (in order to print clear of other labelled places) in the very place Pleiades shows it.  But no one caught this error when going from printed map to digitized data point.

The harbor of Naxos with Pleiades 599630 in the middle of the bay just as depicted in the Barrington Atlas.

Pleiades 585129 described as "An ancient place, cited: BAtlas 59 B3 unnamed wall (Phalerikon Teichos, Leophoros Syngrou)" is shown deep in the Saronic Gulf at 37.875 N, 23.625 E which is about 8700 m from the nearest part of the Phaleron wall.    

The Piraeus.  End of the Phaleron wall at green. 
Pleiades 585129 in the middle of the bay (red).

I found these problems in just a few minutes because the Mycenaean Atlas Project is now offering a way to see the entire Pleiades data set.  I anticipate running across many more of these digitizing errors.

At one time I thought I saw a curious error signature in Pleiades' data.  There appeared to be two different values around which their errors tended to cluster which made the error distribution bimodal.  We even see that a little bit here.  Three of the five values I've noticed here are about 8500-8700 m. off.  This has to be a regularity.  I was puzzled by this at the time I first saw it and yet now I may have discovered the reason.  I suspect that Pleiades digitized their data from maps of different scales.  Errors would be greater on small-scale maps and smaller on large-scale maps.

What factors might be responsible for Pleiades' lack of interest in their own data?  There are several.

1. Size.  The Pleiades database (if that's the word I want) contains nearly forty thousand items.  The consensus among people who work in creating maps of antiquity is that it takes at least one hour per data point.  Often it takes five or six hours.  In six years of steady work on the Mycenaean Atlas I have mapped about eleven-thousand sites both modern and Bronze Age.  A man-year is 2200 hours.  Six man years is 13200 hours.  So: 1.2 hour per site.  Using the same number of 1.2 hours on 38000 sites (the size of Pleiades' database) suggests that it would require 20.72 man years to validate the sites in the Pleiades DB.  Who has that kind of time?  It's clear that when the Pleiades managers were faced with this potential cost they decided that what they had was based on the Barrington atlas and so the raw digitized data was great.  Pleiades as a scholarly endeavor was never in the cards.

2. Pleiades appears not to have their data in a true database.  They provide downloads in several different formats: .csv, .kml, .json, .xml, .turtle, etc. but no .sql or anything that looks like true DB output.  I created a true relational DB from their data in about four days but I had to use their .csv version.   If I'm right in my opinion that there is no true relational DB behind Pleiades then it means that their basic representation of all this data is just incredibly long lists, probably in .json format.  If that's true it might  explain the instability of their Peripleo product.  There is a lot of talk on the Internet about .json  databases.  That's a contradiction.  .json formatted data is just a heap of data, not a database.  It has none of the advantages of a DB and all of the disadvantages of a verbose, over-inflated list, including long run times which get longer as the list gets longer.

3. Lack of interest.  I have read a number of the Pleiades papers over the years.  From these papers it is obvious that the Pleiades team's real interest is Computer Science.  Their primary focus is the netherworld of symbol-based Artificial Intelligence where computers are 'semantically aware' and new knowledge is always just one Improved Reasoner away.  It is a world of religious belief, of the faith that moves mountains and in which continued funding is the new version of everlasting life.  It's not that they don't talk much about toponymy - they don't talk about toponymy at all.

4.  Inability to easily see their data on a map.  It's not clear to me that the Pleiades team ever had
decent tools for looking at their own data.

Soon, perhaps next time, I'm going to put the Pleiades enterprise into perspective and try to explain what they're really doing.