Interest in Personal Data Stores is hitting the mainstream. An appetite for a “secure place to put data” – either an cloud-based service or a software platform that allows people to securely keep things for an indefinite time – has been whetted by a combination of two distinct factors. The first has been the huge increase in the need to store stuff resulting from the activities people now perform on-line. These include wanting to archive memorable chat or IM conversations with friends, photos and videos taken and shared, articles and videos that they enjoyed and so on.
The second factor arises from a discomfort some people are feeling from using free or “freemium” cloud-based sharing and storage platforms to archive their personal information. This discomfort derives from any number of concerns, either the perception that cloud-services are untrustworthy as a place to store private information – that such services might expose or ‘mine’ this information and sell it to third parties, inadvertently expose this information by being hacked or breached, or may be monitored by governmental agencies. Or perhaps that, susceptible to economic pressures, such platforms may suddenly change their terms of service, limiting access to a person’s data, or even disappear and perish altogether, taking one’s personal data with it.
Such motivation and concerns have sparked a great many number of projects and platforms that advertise themselves as “personal data stores” – including, for example, The Locker Project/Singly, Mydex, Personal.com, ownCloud and others. Yet there is not much that is in common with these self labeled “personal data stores” – and little agreement about what a Personal Data Store should support at minimum, to provide the functionality imagined by most users.
isn’t it just a database?
To meet the intended use cases imagined for PDSes, a number of requirements must be met. The requirement is that it’s, well, personal. What does this truly mean though? The OED has two definitions for *personal*:
- belonging to or affecting a particular person rather than anyone else
- of or concerning one’s private life, relationships, & emotions rather than one’s career or public life
From definition 1 we could argue that a truly personal data store would have to be able to ‘belong to’ a person, and function and exist without having to affect or involve anyone else. Of this first requirement, nearly all of self-proclaimed personal data stores I mentioned previously already fail; being cloud-based, they inherently all involve someone else, namely the companies, system administrators, and other individuals responsible for keeping the service running.
From the definition 2, we could argue that this implies a PDS must be suitable for storage of information for personal data concerning one’s private life, as opposed to say, the kinds of data one places on public-facing Web sites such as Facebook, Twitter, blogs, and so on. What could this possibly mean?
Based upon our research in personal information management and perceptions of Web-based cloud platforms, we think that there are at least 3 important dimensions
sensitive data storage
Many kinds of personal data may be very sensitive in nature, from financial records to health records. As individuals have vastly different perceptions of what particular kinds of data should be kept ‘private’, a PDS should support various shades of access in a data-agnostic manner. For example, while many people publish their exercise (running/walking logs) on-line services such as Fitbit and Nike+, others with mobility or cardiovascular problems may wish to keep such logs private for fear of embarrassment.
Encrypting data stores is easy – but keeping data secure isn’t. Among the issues that need to be addressed is making sure that applications that are granted access to the trusted data (such as allowing users to access it) don’t do naughty things with it. Such violations of trust could occur willingly (e.g., through embedded malware/spyware) or inadvertently.
structure-agnosticity & user-centric organisation
While public data, such as our personal profiles on social networking sites, often have a fairly well established structure, private data may have arbitrary structure, and be comprised of lists, spreadsheets, Tweets, bookmarks, random thoughts, favourite song lyrics and so on. Our study of information scraps revealed a huge variety of the kinds of things people kept regularly, and that while there were a few very common types, there were many “one-off” ‘weird things’, an observation which defies simple organisation schemes.
Related research by Bergman at al have have supported our observations that subjective attributes, properties and context are the most valuable kinds of metadata for helping people retrieve and work with their own data. While this statement may seem seemingly obvious, very little of our data on-line supports organisation according to such user-subjective principles; as just one of countless examples, Facebook does not let its users organise messages, wall posts, or even photos entirely as users please — messages and wall posts are constrained to be chronologically organised, and photos are restricted to single-folder owned-by-author schemes.
long-term access and retrieval
The most critical aspect of personal data stores which have not been fuly explored is the long-term (indeed, lifetime) archiving and survival of data. This is particularly challenging given the rate at which software and hardware platforms evolve year to the next, and the rapid boom-bust cycle that tech companies frequently experience – meaning that platforms are born and die in fractions of a decade. The result is 2 ‘R’s : resilience and readability
Data resilience is ensuring that your valuable data doesn’t disappear. There are nearly countless ways that data you didn’t want to disappear might go missing against your will. Hardware failure, software failure, user error, clouding hosting providers going bankrupt, hosting services being seized by governments, hostile situations in a person’s home country, home fires, lost flash drives, and so on.
There are many common techniques people use to improve resilience and reduce the likelihood of data loss. One simple technique to guard against failure of a storage container is replication; e.g., putting a copy of one’s data in multiple places. A challenge in leveraging such a technique, however, is to allocate a mechanism by which people can allocate such multiple sites effectively, and a scheme by which multiple such PDSes can be kept updated with one another.
The second challenge concerns not losing access to the data, but the ability to read it. With data formats and platforms changing constantly, many people have already lost access to important files and documents preserved on tapes and diskettes in file formats created by software platforms that ran on hardware from decades past. In many such situations, the cost of recovering this data (requiring specialists who can reverse engineer the files format(s) or who still have working instances of such machines is too expensive to warrant actually recovering the data.
Professionals who deal with this challenge as their full-time job include archivists and librarians. The DSpace Digital Archive at MIT, for example, archives every single Master’s and Ph.D. thesis created at MIT (including those from 100 years ago) and is pioneering efforts in this space. To this end, their technique has been to use simple, open data formats that, by being thoroughly documented (with documentation that itself will be similarly archived), will always be readable – even if computers, hardware and operating systems change significantly. The idea that a programmer in the future could read the description of the data format and write a converter to easily decode the data.
In this initial blog post, we have offered several suggestions pertaining to our perspectives towards requirements that should be integral to the design of Personal Data Stores, were they to fulfill the role as conveyed by much of the popular media. These three goals are core to our own efforts at developing a distributed free and open source personal data platform, which we will describe in a later post.