Introduction

While developing the Berkeley Ecoinformatics Engine (Ecoengine), we followed a number of standards and principles. Not all of them are fully implemented, in order to keep the Ecoengine lean and fast. This introduction covers some of our design decisions. We hope for user feedback to review these decisions and to make the Ecoengine useful for a wide array of users and applications inside and outside of academia.

Interoperability and Documentation

The development of an API serving heterogeneous data sets requires many decisions, e.g. how to transcribe field names, what fields to include, how to deal with obviously erroneous or incomplete records, exclude rarely used fields, normalization or de-normalization of fields, etc.

Our data sets are well curated, and their institutions and curators are well documented. However, some of the older data are not well presented by external Internet sources. In cases where we are the primary data provider, we will develop and serve comprehensive meta-data based on our research. If possible, we also provide raw data to make our curatorial decisions transparent. An example is the Wieslander Vegetation Type Mapping Survey, which is served exclusively through the Ecoengine.

The complexity of the data and technical limits required some decisions by our development team. These decisions may be reviewed by the curators later. Nevertheless, the persistence of data and URLs has been a major principal.

Since the main concern of the Ecoengine is data discovery and rapid analyses we focused on interoperability, instead of negotiating new standards on which every data provider and potential user could agree on.

We started from existing, well established standards (most prominently DarwinCore) and made sure that:

  1. Data presentation is simple.
  2. Returned data can be easily remapped by consumers (using JSON as preferred return format).
  3. Linking back to the original (upstream) representation of the data on the Internet (when available).
  4. Documentation of design decisions.

Engine vs. Archive

The Ecoengine is not an authoritative data archive. It is rather a directory and gateway to widely distributed data. It stores its own copy of data in order to allow for quicker access, processing, searching, and aggregating. The term “Engine” denotes that the Ecoengine is first and foremost a tool for scientific data analysis.

Warning

The Berkeley Ecoinformatics Engine is not an authoritative data archive. For terms of usage and accuracy of the data, please revisit the original sources as documented by the /api/sources/ resource or links provided on the record level.

Time and Geographical Place as Unifying Concepts

The organizing concept of data in the Ecoengine is that events can be identified by their occurrence in space and time. Both of these dimensions are consistently present in all data (even though not necessarily in the form of exact time stamps or geographical coordinates). This is an important precondition to describe global and local change in biological systems.

We hope to provide this information as precisely as possible and document the precision itself (e.g., geographic precision depends on the quality of the source locality data).

Data Types

The Ecoengine is organized around two major data types: 1) events and 2) multidimensional rasters.

Event Data

Event data are generated in so-called collection events. They consist of single “events” identified by time and space. A typical example of an event in the Ecoengine is a specimen that was collected in the field and is currently represented by a physical object in museum collection. Other types of events include photographs, species lists, a survey of plots, and more.

An Event, e.g. a photo taken in a certain place can create secondary events, e.g. species observations within the resulting photo. In the Ecoengine, such events will have both a photo record and an observation record at the same time and are internally linked to each other (see resource documentation for details).

Raster Data

Raster data presents continuous or discrete phenomena across a regularized grid. Each pixel or grid cell is represented by a scalar value. Most of the multidimensional array-oriented data within the Ecoengine includes the results of various climate model runs.

GBIF DarwinCore

Since observations of organisms (specimens, plot data, observations of living organisms) are currently the biggest resources within the Ecoinformatics Engine, it uses a shortened version of GBIF’s DarwinCore fields.

RESTful API

The Ecoengine is an implementation of a RESTful API. A RESTful API can be described as a machine-readable web-based representation of data resources utilizing the functionality of the HTTP protocol. Compared to other forms of web services, simplicity, ease of use, and interoperability are among the main advantages.

API stands for Application Programming Interface. The main function is to provide a well documented, tested, and persistent interface for applications to use data resources and built-in functionality allowing for data search, aggregation, and extraction.

REST is short for Representational State Transfer (REST) a software architecture, proposed by Roy Fielding in 2000. There is a lot of discussion regarding which specific implementations are RESTful and which are not. Many of them are questions of taste or design decisions, best (or not so good) practices that were not predetermined by the original paper.

Main principles are:

  • the usage of the full flexibility of the HTTP protocol, including Headers, different verbs GET, POST, DELETE, PATCH, PUT,
  • representation of state within the request structure including Request-Headers, URL/URI, and posted data.
  • a request should return the same result at any time (persistence).

The usage of the http protocol and asynchronous AJAX requests issued from client-side JavaScript applications is nothing new to web development. However a RESTful API provides a framework that makes data and functionality that are usually part of back-ends available to a wider developer community.

Current Status of the Ecoengine

Currently, the Ecoinformatics Engine is a rather simple API and only accepts GET requests that do not require authentication.

The Ecoengine supports Cross-Origin Resource Sharing (CORS) meaning that the same origin policy of JavaScript requests is not enforced. For this reason Ecoengine data can easily be included in any JavaScript application, an additional back end is not required (however it might become necessary to handle authorization later).

We plan to maintain the current level of open access in the future to support a wide range of users. However, authentication will be implemented to support POST requests as well as less throttled access for selected applications in the future.