ESRI Shapefile is a file format for storing geospatial vector data. It has been around since the early 1990s and is still the most commonly used vector data exchange format.
While Shapefiles have enabled many successful activities over the years, they also have a number of limitations that complicate software development and reduce efficiency.
We, members of the geospatial IT industry, believe that it is time to stop using Shapefiles as the primary vector data exchange format and to replace them with a format that takes advantage of the huge advances that have been made since Shapefile was introduced.
Shapefile does a lot of things right. Here are some reasons why Shapefile is so heavily used:
Why is Shapefile so bad? Here are several reasons why the Shapefile is a bad format and you should avoid its usage:
By default there is no definition of the coordinate reference system
used. You can do it using e.g.
.prj, but first: this is not
standard part of the specification and second, there are still some
issues, see projection issues and multifile format further below.
The Shapefile format uses at least 3 files (*.shp, *.dbf, *.shx). Users cannot share just one file; you must send them all. Users typically zip all the files into one archive and unzip them on the other end of the distribution chain, but this is cumbersome and error-prone.
In addition, other geospatial software packages routinely add their own extensions to try to overcome Shapefile limitations. Custom additions are not supported by other tools and limit interoperability.
NOTE: 3rd December is considered the International Shapefile day, because thanks to modular, extensible architecture it can have 12+ sidecar files, 3 of which are mandatory.
Attribute names are limited to 10 characters max. Longer names are usually automatically shortened. This leads to abbreviated and/or cryptic attribute names that are unintuitive to the recipient of the data.
There can be only 255 attribute fields in the database file. For some applications this is limiting, especially in combination with the flat table structure.
Float, integer, date and character string data types are supported. Floating point numbers can be stored as text, but there is no support for big integers (thus the format is not usable, you have data with big integer identifiers, such as cadastral maps) and the text is limited to only 254 characters.
There is no support for more advanced data fields such as blobs, images or arrays.
There is no way to specify the character set used in the database. Many applications are using the old Windows-* or ISO-* data encodings, while nowadays we are tending to use UTF-8 more. Still there is no way to specify this in file header.
The support for Unicode characters is also very limited.
The size of both .shp and .dbf component files cannot exceed 2 GB. GDAL Shapefile driver overcomes this limit, but
The Shapefile format explicitly uses 32bit offsets and so cannot go over 8GB (it actually uses 32bit offsets to 16bit words), but the OGR shapefile implementation has a limitation of 4GB.
For compatibility with other software implementations, it is not recommended to use a file size over 2GB for both .SHP and .DBF files.
So 4GB is all you can have in single Shapefile. This sounds enough, but not for all cases.
Shapefile is simple-feature format. There is no way to store more complex geometry relationships.
Each file can be only one of the supported geometry formats (Point, Line, Polygon and others). Mixed geometry features are not possible.
The data structure is limited to flat tables with no hierarchies, relations or tree structure.
Shapefile can't store material definitions nor textures (images with texture coordinates). 3D models are stored as a triangle or polygon soup, with no watertight models or parametric geometries being supported.
By default, Shapefile contains no information about coordinate reference
system at all. But some software packages do accept
files, which may contain CRS description.
It uses Esri WKT definitions, which are often incompatible with standard definitions in EPSG or other sources regarding aspects such as axis order or unit definitions. Furthermore, they often miss parameters required for reprojection ("Missing Bursa Wolf Parameters", anyone?)
Line and polygon geometry type, single or multipart, cannot be reliably determined at the layer level, it must be determined at the individual feature level. This leads to incositancy during automatic data processing, you can not relay on input geometry type and test each feature, whether it is single geometry or multiple geometries.
There is no way to mark no data in a field of the attribute table. You cannot distingues zero and no data for numerical fields.
Do you know about more limits or do you want to extend existing ones? Please do so via pull-request or comment in the repository.
What are the alternatives to the Shapefile format? To be honest, no alternative format has overthrown the Shapefile hegemony yet. Some formats nearly took over (KML, GML, GeoJSON), but their usage was limited to relatively narrow use cases only.
Although there are more then 80 vector data formats in use out there, only a few can be considered as candidates for Shapefile replacement. Please note, that we do take only open (preferably community) formats into account.List of some Shapefile alternatives
https://t.co/6JZZRiP8q5 featuring two formats as @shapefile replacement. Which do you prefer? #switchfromshapefile #geojson vs. #geopackage— Jachym Cepicky (@jachymc) October 5, 2017
OGC GeoPackage is one of the most promising formats, designed for today's modern applications. GeoPackage is published as standard by the Open Geospatial Consortium.
GeoPackage is an open, standards-based, platform-independent, portable, self-describing, compact format for transferring geospatial information.
The GeoPackage Encoding Standard describes a set of conventions for storing the following within an SQLite database:
There are several published extensions for GeoPackage which make this format even more powerful.
GeoPackage is now (2017) supported in most GIS software packages.
One downside to GeoPackage is that the underlying SQLite database is a complex binary format that is not suitable for streaming. It either must be written to the local file system or accessed through an intermediary service.
We recommend GeoPackage as a Shapefile replacement for scenarios where the recipient will want to query or edit the data locally.
FlatGeobuf is a new format, designed for performance and simplicity.
FlatGeobuf is an open, standards-based, platform-independent, portable, self-describing, performant and compact format for transferring geospatial information.
We recommend FlatGeobuf as a Shapefile replacement for scenarios where performance is critical and system to system integrations. Because of the streaming capabilities it is also suitable as an alternative WFS output format and is available as an official extension to GeoServer.
"GeoJSON isn't a shapefile replacement."GeoJSON is a community format based on the popular JSON data exchange format.
-- Sean Gillies
GeoJSON is very simple, human-readable, text-based format. Although it is technically possible to use it with more coordinate reference systems, the specification states clearly, that WGS84 is the only system, which should be used. It can handle complex vector data features and build complex hierarchical data models.
Since GeoJSON is a JSON encoding it is very easy to parse. It also supports streaming (features are dealt with as they come in without waiting for the whole file to load).
The problem with GeoJSON is that not all geometries can be represented and advanced coordinate reference systems are not well supported.
We recommend GeoJSON as a Shapefile replacement for data interchange particularly for web services. For datasets with geometries or coordinate reference systems not representable in GeoJSON, GML may be suitable.
Another OGC Standard.
GML was picked as the main distribution vector data format the European INSPIRE initiative. It's a very complex format, and its direct usage in GIS software is limited. Its main use is as a data exchange format that needs to be ingested into the user's system (e.g. into a database) to be fully usable.
GML is currently often used for open data datasets, since it is technology-neutral and a supported OGC Standard.
A major downside to GML is that it is an insanely complex standard. Few software packages support the entire standard and support for individual parts of the standard varies widely.
We believe that GML is a candidate for Shapefile replacement for data interchange in situations where data is too complex to be represented by GeoJSON. However, for the vast majority of datasets GML is overkill.
SpatiaLite is popular database, file based data storage.
SpatiaLite is an open source library intended to extend the SQLite core to support fully fledged Spatial SQL capabilities. SQLite is intrinsically simple and lightweight:
Support for SpatiaLite is relatively limited and most software that supports SpatiaLite also supports GeoPackage as well. They build on top of the same underlying technology, SQLite.SpatialLite lacks the support for extensions or raster data present in GeoPackage. While these are not necessarily must-have features, they may be useful. Like GeoPackage, it is unsuitable for streaming.
Since SpatiaLite offers no clear advantages over GeoPackage at this time, it should only be considered as a Shapefile replacement in niche scenarios.
Some people tend to use comma separated files for storing geospatial data.
Among non-geospatial people, CSV is very popular, but for most geospatial applications it is not an ideal format.
At least two reasons for not using CSV as Shapefile replacement: It isn't standardized (there are many dialects out there) and support for non-point geospatial data is complicated.
OGC KML was a popular popular vector data format due to the popularity of Google Earth.
KML was originally devised as the exchange format for a software package called Keyhole. When Google purchased Keyhole and released it as Google Earth, KML gained in popularity. However, as the geospatial community hit the limits of both Google Earth and KML, KML's popularity has waned. Since it is XML based, it is not efficient for storing larger datasets. It combines cartography along with the data geometry in one file, which is problematic when the data has the potential to be used in multiple ways. Since it officially supports only the WGS-84 coordinate reference system, it is not suitable for a number of applications.
At its most basic level, an ArcGIS geodatabase is a collection of geographic datasets of various types held in a common file system folder, a Microsoft Access database, or a multiuser relational DBMS (such as Oracle, Microsoft SQL Server, PostgreSQL, Informix, or IBM DB2).
GeoDatabase is very often used in the ArcGIS environment as the main exchange data format. Its features are very complex and advanced.
On the other hand, since it is a proprietary closed format, implementations outside the environment of ESRI products are extremely limited. It is only a candidate for replacing Shapefiles in an enviroment centered on ArcGIS.
Last modification: 2017-10-08
Initially created by: Jachym Cepicky, OpenGeoLabs s.r.o.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
Contribute: On GitHub