Thursday, December 13, 2007

Summary of Chapter 14 - Final

Chapter 14 – Data Standards and Quality

I have 25 years of computer programming experience so I am quite familiar with the issue of data quality. Even more applicable is my extensive experience with “interchanging data” between systems. This arises when old systems are replaced with new systems, or when off-the-shelf software package are integrated with an existing system.
The author introduces the issue on page 468 under the topic “format standards”. It applies to attribute data more than to spatial data. However, I don’t think he specifically addresses this problem that I have encountered.

An example might be when a city is changing their software package and the new system allows for 10 values for “types of light poles”, whereas the old system only allowed for 6 values. When the new system is installed the old data must be mapped to the new system but there might be some confusion as to how to correctly map the old values to the new. And, as the new system is used the workers might continue to use the old values and introduce inconsistency into the data values.

This is closely related to the issue of data standards because strict definitions of every field of data are very important, and training of the personnel using the system and entering data is critical to data integrity. The data is only as good as the people who enter it and maintain it.

I found the whole topic of measuring data quality very interesting because I have never worked in an environment where the data could be accurately tested for accuracy. For example, I have worked in hospitals where the patient treatment is coded by trained personnel. However, there was no easy way to verify that the recorded treatment is accurate because the physician’s notes might be wrong, and the physician’s memory and the nurses’ memories might be in error, and the patient may not know exactly what was done to him. This contrasts sharply with spatial data where the actual real-world position of an object can be tested.

It was good to see that the federal government has established meta-data standards for spatial data (SDTS, Spatial Data Transfer Standard), and for measuring spatial data accuracy (NSSDA). However, the book points out that it is expensive to measure this. Usually, government bodies would rather go out and collect more data than go back and check their existing data. Test points must be established and checked against. And the NSSDA only applies to point accuracy, and not to line features or polygons.

There are four main ways to check spatial data accuracy: positional, attribute, logical consistency and completeness. Actually, now that I’m writing this I think the author might categorize my opening comments as “logical consistency” or “attribute” errors. The book uses an aerial photo to clearly describe the four issues. The physical position of the houses in the photo might be inaccurate, or houses might be labeled as garages, or a building in the photo might be completely left out of the data, or, finally, items might be illogically stored in the data, for example, light poles might be in the middle of a street. If data is missing that is a problem of ‘completeness’. This may be caused by the minimum mapping unit being to big. Features smaller than the minimum mapping unit will be left out. Or, perhaps the data is not current and new features have been added or removed that are not reflected.

The book establishes a distinction between accuracy and precision. Precision indicates that the data items are consistent but they may be inaccurate. Accuracy indicates how close a data item is to the true location. Sometimes precision errors may occur because of a bias being introduced by something like equipment problems. The Federal Geographic Data Committee (FGDC) ahd set a standard method for measuring data accuracy. There are five steps to this process of the FSSDA. First, select test points, 2. define independent control data set, 3. collect measurements from both sources, 4. calculate positional accuracy statistic, 5. report the accuracy statistic in a standardized form included in the metadata.

To measure the accuracy of point data the distance formula, or Pythagorean theorem is used to measure the distance from the true location and the data location. This length of this distance is the margin of error for the data.

As mentioned previously, there is no established standard for accuracy of linear features, only for data points. One common approach is to define a epsilon band. This might be defined as a margin of error for the accuracy of a line.

Another interesting point mentioned is that format standards have been lacking and vendor standards have filled the void. Specifically, ESRI shapefiles and interchange files were a common format for years. However, the lack of official standards left room for these vendors to change things on their whim and introduce problems for the whole field of GIS. The new emerging standards will tighten this up and hold everyone to a common interchangeable data model.

Another interesting point is that the rapid blossoming of GIS systems and data in just the past twenty years has caused an increase in the study of data accuracy. Academic researchers have been studying spatial data quality for a long time. But it is only recently that it has intensified. Some of the measurements used to measure data accuracy are: average distance error, total area that is classified incorrectly, biggest distance error or, the percent of data points that are in error.

This is an important topic. Now if only the media and news organizations would try to establish a similar standard for the accuracy of their reporting. We could judge what newspapers and TV news broadcasts are reliable!!!

1 comment:

Anonymous said...

Hello. This post is likeable, and your blog is very interesting, congratulations :-). I will add in my blogroll =). If possible gives a last there on my blog, it is about the Home Broker, I hope you enjoy. The address is A hug.