Distributed Versioning for GeoSpatial Data (Part 3)
Distributed Versioning, Potential Development
This is the final paper in a series of three describing OpenGeo’s vision for distributed version control. This paper describes our proposed work path toward a fully realized infrastructure of distributed versioning tools for geospatial. The goal for this set of software development is to be grounded in real-world use. Ideally tools will be released as early and often as possible, and all will all remain open source to enable distributed innovation. OpenGeo is fully aware that the only way to develop outstanding software tools is to continually iterate based on feedback from expert and less experienced users; many items listed here will be continually revisited, making core enhancements as we learn more.
The items are listed in a potential development sequence, but this order is only intended as a suggestion. Some items could be developed in parallel and some could move earlier in the queue for certain use cases. For background on what has already been accomplished please see the previous paper in this series: Distributed Versioning Implementation
1. Basic Versioning JavaScript Interface
Currently we have a working distributed versioning backend, but no front end tools that take advantage of it (aside from some modifications to uDig to enter commit messages). A basic versioning front end should enable editing commit messages, a view of the history of edits, and ideally some sort of rollback. In short, this front end would be an update of the functionality prototyped and demonstrated several years ago.
The interface will be updated to use GXP components, so it can be easily incorporated into GeoExplorer, GeoNode, and other custom applications. Because it will rely on the versioning backend, the quickest way to implement this interface would be to use the existing protocols: either the GeoSynchronization Service or WFS 2.0 versioning constructs. Both platforms can record some history, and commit messages can be created by overloading the “handle” element of WFS. Neither platform offers “rollback” as an operation, but this functionality could be added as the start of a REST API for distributed versioning or even left off the first round of development.
2. Full Linear Versioning JavaScript Interface and API
A mature versioning interface should offer more sophistication with respect to “diffs” and “rollbacks,” not only displaying a history but also enabling comparison between two versions. This comparison should be possible for both spatial and non-spatial attributes, and ideally should be visual and intuitive. Developing such a tool will require great deal of user experience design and research. Unfortunately there are limited examples of intuitive visual interfaces in the geospatial world. In the first round, we’ll be aiming for something functional.
This UI will also drive an open API. It’s our experience that the best way to develop a useful API is to start with the needs of a concrete client, rather than attempting to design in a vacuum. This API will likely evolve when other clients start using it, but a well planned, web-friendly REST API designed for a JavaScript client can form the basis for a developing interface. Meanwhile the system will continue to implement any other relevant open standards.
This development item should include some UI/UX work, with a goal of making an intuitive wiki-like interface for mapping. There are some decent precedents to draw upon, such as OpenStreetMap and Google's MapMaker, but we hope to make that level of versioning possible with any geospatial layer. We are also looking for more sophistication with respect to “diffs” and a means of making the history of edits accessible to all. However, at this point we are not yet looking into branching and merging. We are still only considering a single, linear, non-distributed versioning environment.
3. GeoNode Integration
Once a full linear versioning JavaScript interface has been created, the next step would be to integrate it into GeoNode. GeoNode offers a well developed user interface for groups and permissions, enabling users to experiment with their data without having to build custom editing environments. With the first development iteration of the GeoNode system, users could choose to enable versioning, and as development progresses we may consider making versioning a default for all uploaded data in GeoNode. Another integration point would be in the notification and activity stream functionalities in GeoNode. This would enable user edits to be visible in real time, and for notifications to be requested when certain sections are edited – e.g. when someone else is editing a part of a layer where you have particular expertise This GeoNode integration should make full linear versioning available to a much wider audience, which can help drive priorities for the interfaces’ next directions.
4. Front-end Research
The biggest development item is the user interface and user experience around geospatial versioning. This item can be broken up across any of the topics that involve user interface. Regardless of how the development tasks are subdivided, the key to their success will be the expectation that they be built not just once, but many times over through continual user testing and iterative improvement. This feedback loop should include full UX testing and review, observing typical users on the software to identify areas for improvement.
The standard UX testing, however, will not be sufficient for our purposes, as we are not just aiming to make existing concepts usable but to invent new concepts to suit the environment of geospatial. What is a “diff” and a “merge” in the context of geospatial? How does a “clone” make sense with respect to geospatial mapping? With source code, the primary users are software developers, who are accustomed to technical tools and, though they might appreciate attractive GUIs, are content with Git’s command-line nature. With geospatial, though, the typical users are not as advanced, and may need a more visual interface to make sense of the information.
We will need to find a means of breaking up the interface development into discrete items, as it will be an essential part of any funding round. This distributed versioning effort will not succeed without easy, intuitive user interfaces, because our target users are not just GIS experts but anyone who has ever wanted to modify a map to suit their needs.
5. Real-World Applications
There are a number of real-world applications that can benefit from tracking history and versions with basic diffs and rollbacks. These applications should be prioritized once the basic software functionality is in place to get the code quality up to production level and evolve the user interface with real users. The real-world applications should be sought from a variety of domains, with different types of workflows and users, pushing the core in a variety of directions. These directions should be handled in an agile manner rather than waiting to proceed until the improvements are fully integrated into the core. After several different real-world applications are completed, commonalities between the improvements can be evaluated and more generic functionality can be incorporated. Some of these applications will likely start as development items described here, such as staging areas and mobile syncing and merging, but may also lead to unanticipated innovations. It is essential in this research and development to tackle real-world problems early on to ensure that those without programming expertise can use the applications.
6. Basic Mobile Versioning
The first step toward mobile versioning is to focus on connected scenarios where the mobile device has an internet connection. This stage simply involves optimizing JavaScript components for mobile devices. Much of the code can be reused, but the widgets should be created in Sencha Touch or JQuery Mobile interfaces, with attention given to intuitive UI.
The next step is to tackle disconnected editing. In the basic version, this would be limited to tracking the changes made offline and committing them when online. Ideally, each change would be recorded individually, so that the sync back to the server creates each individual commit with a timestamp and commit message. This functionality will require some advancement of the distributed versioning API to handle a bulk upload.
In this basic disconnected scenario, we won't attempt to address merge conflicts on the mobile device. These conflicts will be placed in a rejection queue or resolved through a custom workflow in the main application.
7. Basic Desktop Integration
Desktop GISs should also be able to interact with a versioning server. Primary targets should be ArcGIS desktop and whichever open source programs seem appropriate - uDig, QGIS and gvSIG, for example. The first priority is to make versioning work effectively without necessarily being visible. Some users may not have any awareness that they are editing using a program that tracks versions, but their edits should be tracked like any others. There are several different ways to achieve this default versioning:
- Ensure the versioning server can handle web editing protocols that the desktop tools already use - WFS-T for uDig/gvSIG/QGIS and GeoServices REST Specification for Esri.
- Write triggers on the databases (PostGIS, ArcSDE, Oracle) that the desktop tools directly edit to keep the repositories in sync.
- Create plugins that understand the advanced distributed versioning APIs and edit directly against them.
The last option requires the most work, but would best establish a base for additional functionality in desktop GISs to truly take advantage of the distributed versioning system. The simplest fix would be to enable users to enter a “commit message” when they make a change, but much more is possible, some of which is articulated in the “Advanced Desktop Integration” item below.
8. Conflict Resolution and Merging
To advance development on these topics (disconnected editing, synchronizing servers, branching) we must first address some fundamental questions of conflict resolution for geospatial. The linear versioning functionality is exclusively online, so all users are connected and major conflicts are unlikely. Changes are sequential and therefore less burdensome for those that receive the updates. Editing offline, in a disconnected environment, becomes a much larger problem, because many changes may need to be committed simultaneously when the user attempts to sync.
Workflow will need to be configured around conflict resolution to determine who has the authority and responsibility to resolve the conflict. Better still,, would be to improve merging at such a granular level that two users could be editing the same river but so long as their edits are on different vertices they can be merged without conflict.
This process leads into “branching”, which is relatively easy with these data structures. The much bigger challenge is the user interface. Instead of trying to start with generic branching interfaces, where any user could create a new branch, we should focus on a few discrete workflows for disconnected and multi-user editing. One common and relatively simple workflow creates a staging area – an “approval branch” where edits are sent for review and, once approved, immediately merge into the main stream.
Creating an effective approval branch process will require testing a number of real-world workflows to discover where problems arise when two users edit the same area. The core structures should make the functionality possible, but the more significant challenge is creating an intuitive interface that also makes it easy to resolve conflicts. These improvements will likely evolve the distirubted versioning API to handle branching and merging in a flexible way, even if the user interface options are more constrained initially.
9. Synchronizing Servers and “Cloning”
The initial implementation in GeoServer included a GeoSynchronization Service datastore that demonstrated basic synchronization between two GeoServers. This demonstration relied on a master/slave configuration, with one server reflecting the other’s changes. Distributed version control should make possible peer-based replication between servers, with each one holding its own repository, but able to sync with others so that changes in one server could replicate out. Once conflict resolution is in place, this synchronization should enable some very interesting workflows, with an admin or group of editors taking responsibility for Quality Assurance on data flowing in from other nodes.
This setup would involve a versioned repository that would connect to the versioning API and could be configured for the workflow desired. In Git terms, this repository is created with a “clone” operation, creating a copy of the main repository in another repository. The clone could handle updates to the central datastore in a number of ways. It could: blindly sync as a slave; place changes into a staging branch to be reviewed and pulled in; establish a manual update process governed by an admin; or utilize a custom workflow. With any of these configurations, the core functionality could borrow the same “push” and “pull” paradigms from Git. The user interface, though, should go beyond these concepts to offer more discrete workflow options for selective syncing.
10. Additional Database Backends
The initial implementation has only been tested with PostGIS, but it is designed to work with any data format that can provide stable feature IDs and transactional atomicity. A relatively simple next step is to optimize the system for alternate database management systems such as Oracle, ArcSDE, and DB2.
Each of these implementations should incorporate a means of preventing inconsistent states from arising through direct edits that don't notify the repository. Transaction triggers that directly alert the versioned repository would enable it to stay up to date even when edits aren't made through GeoServer.
Once integrated with these systems, the next step would be to take advantage of some of their versioning capabilities, such as Oracle's workspace manager or ArcSDE's versioning information. (See the section of the on the system’s “backends” for more information on this integration). Essentially, it operates independent of the backend, but could take advantage of each system’s versioning and assist if inconsistent states arise. A more version-aware backend only helps, since the system does not compete with the backend. If an organization is already using advanced capabilities it can add the proposed distributed version control on top for further functionality.
11. Versioning in PostGIS
Along the same lines as ArcSDE Versioning or Oracle Workspace Manager, PostGIS can be improved for better native versioning capabilities. Some work has been completed on pgVersion, but it is rudimentary and not adequate to meet the basic versioning requirements of WFS 2.0 and GSS. With improvements, though, PostGIS could be a high quality non-distributed versioning store, to which distributed version control can add distributed awareness.
12. Scalability
While the core distributed versioning concepts have been tested, making them scalable is another matter. The best way to achieve scalability is with real-world scenarios, profiling the code and testing, iteratively improving the weak points. While these tests may reveal that more fundamental architecture changes are warranted, such restructuring should be undertaken based on tested feedback, not premature optimization.
There are several dimensions to scalability with distributed version control. These include size of data, length of commit history, concurrent access, and total size of the repository. The minimum goal for OpenGeo will be importing the full OpenStreetMap database and history into a single repository. OpenStreetMap is currently hundreds of gigabytes, with numerous historical commits. This repository should be tested not only with thousands of concurrent access points and edits but also as part of a larger repository containing other datasets.
To achieve appropriate scalability, a public service should be created to attract a wide range of users, permitting the software to continually evolve based on their use and feedback. This real-world model has enabled the OpenStreetMap stack to become quite solid through its ongoing accommodation of increasing numbers of users. It should be noted that the goal is not to compete with OpenStreetMap, where the centralized versioning system makes sense for their use case; on the other hand we are aiming toward full distributed versioning capability. The OpenStreetMap data offers an ideal place to test this functionality. Moreover, we also want to help enable decentralized versioning workflows around OpenStreetMap, so that those organizations that rely on OpenStreetMap can use these tools and contribute feedback that encourages a system that is compatible with their existing workflows.
13. Alternative Backend Research and Development
While our current approach seems promising we recognize that it is still in the early stages of development; we’re still open to alternative implementation approaches. There is much experimentation to be done within the codebase; different binary storage formats, key value pair backends, and architecture possibilities to name a few areas. We also would not treat a new area of development as complete until we've had far more real-world testing. We are open to rearchitecting as after our initial findings.
There are other backend implementations that we’re interested in investigating, such as direct implementation of Git or CouchDB. Because their fundamental principles are akin to those of the proposed system, it should still be possible to use the developed front end code as long as we design effective APIs.
Our preferred alternative to explore is the direct use of Git, if successful, this implementation would enable us to leverage a rich tool ecosystem for “free,” as so many tools that work with Git would not work with the geo-specific versioning system. Our initial investigations were not promising, we could not identify an effective strategy for storing geospatial data in a file system that could handle massive amounts of data while allowing fast access and clean diffs and merges. With more time invested it may be possible to improve Git to better handle geospatial data.
A hybrid approach should also be investigated, in which a spatial database acts as the point of access for the latest data and Git supplies the history. The proposal follows a similar concept, but in practice these two components are more closely linked as the same codebase that maintains the version also keeps repositories in sync. A drawback to using Git directly would be the expense of creating additional code to sync its history with the spatial database.
The other potential direction to investigate is the use of CouchDB. It is a nosql peer-based replicating document store, with the most advanced spatial support of any of the nosql databases. It has built-in conflict detection and management, which goes a long way toward enabling versioning. Replication is also built in, so CouchDB is inherently well suited to distributed environments. It's not a full versioning system, but could potentially be extended and/or configured to become one. With its growing spatial support and ability to manage huge datasets, CouchDB could offer many advantages, including the DataCouch project that shares many of our goals including dataset forking (though they apply them to generic datasets).
14. Advanced Mobile Versioning
A more advanced mobile editor would also handle conflict resolution on the device, and would be integrated into an overall workflow, rather than merely gathering data and depositing it in a more centralized store. It would be worthwhile to investigate porting a versioned repository to mobile environments, perhaps even to a lightweight device. It may be possible to use SQLite or a similar library for storage, as a mobile editor will not be as concerned about spatial indexes, huge datasets, or overly long revision histories. This application could be relatively easy to configure on Android, as it would likely use much of the same core Java code.
Configuring full repositories for mobile devices could enable some very interesting workflows, such as straight peer-to-peer data editing. In a truly distributed environment, devices in the field would not need to sync back with a centralized server and could pass changes to one another directly and resolve conflicts on the devices. This capability could enable rapid on-the-fly collection of new geospatial intelligence, with workflows built in to the edges of the distributed network rather than passing painstakingly through centralized approval and dissemination.
15. Advanced Desktop Versioning
Beyond the basic desktop integration, we envision more advanced plugins that take advantage of the capabilities of distributed versioning. They should be able to view diffs, perform rollbacks, and even eventually conduct branching and merging.
The first step in advanced desktop integration would enable versioning functionality directly on a remote server (similar to the functionality of the JavaScript client tools). A more advanced approach will enable the desktop to work against its own repository and push/pull changes to an online repository. This process would likely involve a sort of local “catalog.” It might be akin to ArcCatalog, but it would be a copy of the data rather than just a reference, and it could be kept consistently up to date. The catalog would be a somewhat stripped-down GeoServer, as there would be no need to provide renderings of maps or connections to databases. Still, it would be a fully functioning versioned repository, exposing all the API that the desktop plugin would want to talk to and able to clone from other servers and push changes out. We would run that server on the desktop, so it can truly peer with other repositories from individual desktops to large-scale hosted cloud repositories.
16. User Branching and Merging
Once merging and conflict resolution are working for simple workflows like “approval branches” and other pre-canned branching workflows, we should look in to giving users control over branching and merging. This functionality would allow a single data set to have multiple streams of collaboration. For example, one group of users could work quickly on a branch that doesn't require extensive quality assurance, and their work could be pulled back in to the main line when needed. Other workflows might incorporate automation or an entirely new set of governance policies. Ideally, the user interface will make it easy to create new branches for collaboration to merge the changes from each branch into one another.
With distributed version control, branching and merging should also be possible across multiple repositories. One group might clone a data set and then set up its own workflow, but still want to pull in changes from the main repository. This capability corresponds to the “push” and “pull” functionality of Git, moving changes across different repositories. The core functionality is already largely implemented but work still needs to be done to improve the user interaction. This UI is a priority as it allows users themselves to innovate on workflows, instead of relying on developers to define what types of branches should be created and making tools that only work with those workflows. We've got a good deal of flexibility, and we need to figure out how best to expose the array of possibilities in order to enable user innovation.
17. Funding Development
There has already been significant investment in these versioning innovations, both directly from OpenGeo and through several of our clients. OpenGeo built the initial Versioning WFS on its own internal research and investment, and Landgate funded a prototype JavaScript interface. Within the context of OWS-8, we completed the initial core implementation, GeoServer integration, GSS interface, and GSS datastore for syncing. IGN France funded the WFS 2.0 versioning interface on top of the DVS. LISAsoft has incorporated the DVS into uDig and created a GeoTools datastore for it, and has added push, pull, and fetch operations for their ParkInfo project. OpenGeo has a current contract that will provide additional work towards the first topics mentioned in this article, likely focusing on JavaScript tools, API, and backend improvements.
The total cost of development would require millions of dollars, but this can be split up into smaller chunks that are easier to fund. OpenGeo always coordinates joint funding on large topics, ensuring that all funded efforts build on one another (and, needless to say, never charging two different funders for the same portion of development). We take into account the core needs of all clients and build a flexible solution that adapts to emerging client priorities while continuing to preserve the original functionality. Initially, we are looking for larger funders who can take on significant chunks of development. But we are also looking to build a list of organizations that are interested in using this technology, but may not be able able to fund development. Once a solid base of functions have been established, we need to run the software though a number of workflows in the real world. We’d like to tap these organizations to be beta users, helping us all find out where the software is robust and where it may need to be modified.
© 2012 OpenGeo.
Redistributable under the Creative Commons Attribution-Share Alike license.
Distributed Versioning, Potential Development
Table of Contents
- 1. Basic Versioning JavaScript Interface
- 2. Full Linear Versioning JavaScript Interface and API
- 3. GeoNode Integration
- 4. Front-end Research
- 5. Real-World Applications
- 6. Basic Mobile Versioning
- 7. Basic Desktop Integration
- 8. Conflict Resolution and Merging
- 9. Synchronizing Servers and “Cloning”
- 10. Additional Database Backends
- 11. Versioning in PostGIS
- 12. Scalability
- 13. Alternative Backend Research and Development
- 14. Advanced Mobile Versioning
- 15. Advanced Desktop Integration
- 17. User Branching and Merging
- 17. Funding Development
Other White Papers
OpenGeo Sensor Web Enablement (SWE) Suite
Since 2001, the Open Geospatial Consortium (OGC) has been engaged in developing a set of standards for web-enabling sensors and sensor observations. Version 1.0 of the Sensor Web Enablement (SWE) standards were approved and released in 2007. Versions 2.0 of these standards have either been approved, or will be approved by Fall 2011.
The OpenGeo Suite Enterprise Edition
This paper outlines how the OpenGeo Suite Enterprise Edition augments the innovation of open source software communities with the testing, certification, and maintenance necessary to create and maintain reliable, long-term enterprise production web services.
The OpenGeo Suite is built from several open source projects (OpenLayers, GeoWebCache, GeoServer, PostGIS) that each provide distinct functionality. This paper explains what each component does and how they interact with other components.
An Introduction to GeoWebCache
GeoWebCache is gaining popularity as enterprises look to accelerate their online maps. In this interview, Arne Kepp, the project founder and OpenGeo team member, provides historical background and technical details.
Caching to Improve GeoWeb Reliability
The SDI model of distributed service providers can fall apart when services or connectivity are unreliable. National infrastructure providers can increase SDE reliability by providing a maintained caching infrastructure on top of distrobuted services.
GeoServer in a production environment can be evaluated according to three criteria: reliability, availability, and performance. This paper discusses methods for implementing production grade GeoServer deployments.
Distributed Versioning for Geospatial Data (Part 3)
This is the the third paper in a series of three desribing OpenGeo’s vision for a distributed versioning system. This paper describes our proposed work path toward a fully realized infrastructure of distributed versioning tools for geospatial.
Distributed Versioning for Geospatial Data (Part 2)
This paper is the second in a series of three which into the technology necessary to apply distributed versioning systems for source code control to geospatial information.
Distributed Versioning for GeoSpatial Data (Part 1)
This is the first paper in a series of three that propose a new approach to working with spatial data, recommending a shift from treating spatial data simply as data to considering it as programmers do source code.
Commercial Open Source: Increase Web Mapping Capabilities While Controlling Costs
This white paper compares the relative strengths and weaknesses of closed source geospatial web services software, open source (unsupported) alternatives, and supported open source — namely the OpenGeo Suite.
