Data loss in recent migration

paulwalsh - 12/01/2016 in Community Updates, News

We recently undertook a large-scale migration of DataHub. This migration included upgrading CKAN to the latest stable release, running DataHub on a completely new cloud infrastructure, and moving to redundant S3 buckets for storing dataset resources.

The migration occurred in November 2015, after lots of hard work by our infrastructure and development teams, and we’ve since seen a more performant and stable DataHub, which will serve as a basis for new developments in 2016.

What happened

On the 11th of January 2016, we were alerted in this post on the Open Knowledge Discuss Forum to the potential loss of data. On immediate investigation, it was found that 38 datasets, with a total of 78 resources have been irretrievably lost.

Details of lost data can be found here.

These resources were, unlike the rest of the data on DataHub, stored on the local file system of the server, and not on a cloud-based storage backend. There is no way we are able to retrieve these resources.

We deeply regret any inconvenience caused by this data loss. While DataHub is a free service without any specific guarantees for data persistence, we have let ourselves and the community of users down by this loss of data.

Actions

We have taken steps to prevent this happening in the future. In fact, the move to the new infrastructure is the solution, as resources are now stored on fully redundant S3 storage.

Additionally, DataHub is more generally robust and secure, with a robust backup system in place for the database, and stateless application servers running in Docker in a new cluster. We appreciate the trust the community puts in us to host their data, and we’ll be doing more in 2016 to make DataHub a stable, reliable and free solution for storing and accessing open data.

We are happy to assist owners of these datasets in republishing the data if they desire. Please be in touch.

Organization Migration Complete

Ross Jones - 11/10/2013 in Community Updates, News

IMPORTANT: if you need help please see the forum at help.datahub.io.

In particular: you now need to be part of an organization to create datasets and to register an organization requires admin assistance (via the forum)

The planned organization migration has happened, and we’ve managed to get rid of a lot of spam, but not all.

As a result of the migration, all datasets must now belong to an organisation, and this has meant that datasets that were not part of a group previously, have been moved into a single org (Global – http://datahub.io/organization/global).

If you find that your datasets have been moved into Global, chances are you won’t currently be able to move them to another Organization without help. I’m planning to make it possible for users who created a dataset to be able to move it, but you will still need help in setting up an organization. If may be that you’re happy to leave it in the Global org, in which case you’re probably just waiting for me to enable you as an editor on the dataset – this will hopefully be before the end of the coming weekend, if you need it sooner, please get in touch and let us know.

We’re expecting that it is now a LOT harder for spammers to post to datahub.io, and only slightly harder for users, although there is still some cleaning up to do, and old data to purge, we’re pretty close now to a place where we can add new features (add some more ideas in the issue tracker) and improve the performance.

Whilst waiting for my migration script to finish last night (it took about 5 hours to clean spam and move the datasets) I also installed varnish, so if you see any caching irregularities, please just shout.

Organizations Upgrade

Rufus Pollock - 08/10/2013 in Community Updates, News

A quick note about an upcoming upgrade to help us address spam and improve the managability of datahub.io.

Anybody can currently post a dataset or create a group on datahub.io, and we think this is a good thing. However, it means that spammers can also post datasets, and they have. Whilst we certainly don’t want to close datahub.io from contributions, we do need to make changes to dramatically reduce the amount of spam being posted and we think that the we may have a way to achieve that.

CKAN 2.0 introduced idea of “Organizations” that own/publish datasets and which can have members who can be administrators (who can add users and manage the organization) and editors (who can only add datasets to the organization). This brings many features, not the least of which are that:

  • It provides a much richer permissions and authorization structure (based around the organization) that gives users greater control over who can, or cannot, edit and add datasets
  • It provides a clear organization-oriented structure for presenting and finding datasets
  • It will help us address spam problem by providing more control over who adds datasets as it will be a requirement that datasets are added to organizations.

We plan to enable organizations in the next few days. This will have several major effects:

  • You will only be able to create a dataset if you belong to an organization (and creating an organization at present requires approval from an Administrator)
  • Groups will be automatically migrated to organizations and the user-account that created them will be made the administrator. If you find you don’t have as much control over your organization as you thought, please let us know!
  • If you had datasets on datahub.io that were not part of a group, they will be added to a ‘Global’ organization and we will help you move them to a new organization should you wish to move them.

The migration is likely to take about a couple of hours, and so during this time datahub.io is likely to be rather unresponsive. As a result it is likely that we will take the site down for a short period of time but I’ll make sure that we notify the list should it look like it is going to take longer.

Thanks for your patience whilst we sort out the spammers, hopefully we’ll be seeing a lot less of them in the near future.

Tackling Spam on the DataHub

Rufus Pollock - 30/09/2013 in Community Updates, Get Involved

We’ve seen increasing levels of spam recently on the DataHub even with the presence of captchas and other anti-spam devices.

Our immediate priority is to eliminate existing spam and get on top of new spam using human efforts before putting in place semi-automatic processes for stopping it.

We need your help!

We’re going to need help to tackle this – in particular, we’ll need some folks to get extra-special spam-fighting powers and act as spam-fighters. If you could spare a few minutes a day or even a week please let us know! You can either email me or sign up on this google doc:

Tackling Spam on the DataHub Doc »

Goals

  1. Removal – get rid of all existing spam
    • Despam user list
    • Despam dataset list
    • Despam “related”
  2. Stop it coming back in future
    • Establish a group of monitors working in rotation who kill spam as soon as it appears
    • Put in place new systems and processes to stop in future

Immediate Plan

Our immediate plans are to:

  1. Move to Organizations
  2. Remove existing spam

More details on both to follow soon. If you are interested in helping please add yourself to the list of volunteers in this doc (or tell us in the comments!).

If you have any other thoughts or ideas please let us know!

Trello board for coordinating Community Efforts

Rufus Pollock - 23/09/2013 in Community Updates, News

We’ve been busy since our last update getting things moving. First off, we’ve created a public Trello board to enable better coordination of community DataHub efforts going forward:

https://trello.com/b/SxDPOvJm/datahub

Anyone can add to this board and we’ve created a “Read Me First” card to get you started:

https://trello.com/c/rfwnwU5v/13-read-this-first

If there are particular features that you believe are more important than others, adding a comment to the card on trello will help us in prioritising the order of work – of course, code contributions are always welcome too!

For those who wish to add new feature requests, please let us know your trello username and we can get you added to the board.

DataHub Important Community Update

Ross Jones - 11/09/2013 in News

TL;DR: changes are afoot for the DataHub. Following the recent (technical) upgrade its time for a community upgrade including new dedicated DataHub mailing lists and a call for contributors to new DataHub team.

We’re writing to give an important update on the DataHub – the community data hub powered by CKAN we’ve been running since 2006!

Over the last year or so DataHub has not kept up with the speed of change that has been happening within CKAN itself. As a result it hasn’t received as much attention as it should, and we feel it is high time to bring DataHub up to date and make it the awesome community owned and run data hub it can and should be!

DataHub is and will remain community run, and community owned, and will have a solid technical home within the Open Knowledge Foundation Labs. However, if the DataHub is to meet the needs and expectations of those using it we need to create a team of administrators, curators, developers and more – so we need your help!

There are lots of ways to get involved. We will need help with advocacy, triaging ideas and issues, and writing code. To help coordinate this we have created two new mailing lists dedicated to the DataHub, datahub-discuss for those who want to get involved in helping take the datahub forward and datahub-announce for announcements.1

If you’re interested in helping take DataHub forward please join datahub-discuss list and introduce yourself!

Ross Jones and Rufus Pollock, DataHub Coordinators


Notes

1 With the growth of the DataHub and CKAN it is now time for the DataHub to have its own dedicated space and community – there are now hundreds of CKAN instances whereas once upon time there was only one: ckan.net (We renamed ckan.net to datahub.io in 2011 to avoid confusion between CKAN (the software) and the website!)

Tutorial: the DataStore and Data API

Rufus Pollock - 02/03/2012 in Tutorial

This tutorial walks through the DataStore and Data API features in a CKAN-powered site like the DataHub.

Tutorial: Publish Data with the Datahub

Rufus Pollock - 02/03/2012 in Tutorial

The following simple slideshow walks through publishing data on a CKAN-powered site like the DataHub.