Search
Twitter logo
Mastodon logo
Instagram logo
Github logo
menu
02/13/24
Blogs

Mapping the Decentralized Storage Ecosystem

By Marios Isaakidis, Natalie Cadranel & Chrystalleni Loizidou

What truly defines a storage system as decentralized, beyond data distribution, is the underlying principle that the computers hosting the data are controlled by different entities and that no single entity exercises full control.

We are at the precipice of a major overhaul of the internet, where decentralization, security, and digital rights can be built into the core fabric of the next web. The current internet landscape is undergoing a significant transformation in response to the dominance of centralized Web 1.0 and 2.0 platforms like Facebook, Twitter, and Google, which resulted in issues such as misinformation and erosion of privacy. This led to many drawbacks, particularly for vulnerable groups, including: rampant mis/disinformation, a global rise in fascism, the erosion of civil liberties, and increased censorship, surveillance, internet blocking, targeting, doxxing, and more. The next web draws on the decentralized intent of past models, aiming to create a new framework that transcends and corrects for the past negligence and opportunism. From decentralized tools, like BitTorrent and Tor, to overhyped blockchain-based cryptocurrencies and NFTs, decentralized tools are at the forefront of the evolving web, offering users liberation from centralized control. These tools hold the potential to combat censorship and protect sensitive information, crucial in contexts where such actions can save lives. However, while decentralization offers advantages such as archival resilience, it is not a universal solution, and its limitations must be carefully considered.

At OpenArchive, we center human rights and our communities throughout our research and development processes to best protect those who are often risking their lives to capture, preserve, authenticate, and share evidence of injustice. We know that with the right tools, they have the potential to proliferate justice and accountability by exposing abuses spanning from police brutality, protests, and government corruption to war crimes, environmental harm and many other human rights offenses.

While decentralized storage and the blockchain are highly effective technologies that can improve the archival workflow by offering media redundancy, improved accessibility, authentication, and provenance for that media, we have reservations about their disadvantages for our Decentralized Archivist Communities. At best, we hope that they will be able to provide secure, distributed, verifiable, long-term archival storage for their crucial evidentiary media. Toward this end, we will explore their benefits, risks, and drawbacks in forthcoming publications. We will now focus on mapping out the functionalities of these technologies, their potential, their limitations, and some key projects in the field.

Decentralized Storage 101

At the core of the archival process lies the critical component of storage. In the spirit of the organization and archival ethos “Lots of Copies Keep Stuff Safe” (LOCKSS), decentralized storage can solve the redundancy challenge archives face. Decentralized storage technologies involve distributing data across multiple computers and regions, each independently controlled by a different entity. In this way, they can offer a more robust and collaborative approach to data hosting, long-term preservation, and access.

The concept of distributing data across multiple computers is not, in itself, particularly novel. In fact, distributing data is a broadly applied technology that enables large providers to coordinate the immense storage and computing resources needed to operate many popular online services. Consider, for instance, the massive volumes of data that large social media platforms handle—petabytes of information distributed across datacenters spanning multiple geographic locations.

Distributed storage is a useful technology for large digital archives. The Internet Archive stores, as of January 2024, over 99 petabytes of data in various remote facilities that they administer themselves. In particular, they host multiple copies of the files simultaneously, each stored in a different location, to assure the availability of the service and the resilience of the information stored, as technical failures in one of the data centers will not bring down or destroy the archive. In an effort to ensure the longevity of the archive, they are joining forces with the Filecoin decentralized ecosystem as well as with international entities across the globe that also replicate a copy on their data centers. Brewster Kahle, Founder of the Internet Archive, explains “decentralized storage is absolutely critical, and having no points of absolute control is the only way to make things last a long time”.

What truly defines a storage system as decentralized, beyond data distribution, is the underlying principle that the computers hosting the data are controlled by different entities and that no single entity exercises full control. This kind of “true” decentralization as a way to mitigate risk and liability, serves as a safeguard against undue influence or tampering by protecting the archive from interference, such as censorship or media manipulation, and technical failures –an important consideration when dealing with sensitive evidentiary media. Distributing responsibilities across multiple entities can improve redundancy and split up the tasks of managing, curating and hosting the archive. The cryptographic foundations of any well-designed decentralized system have the added benefit of providing privacy and thus reducing the potential for individual entities to be held solely accountable.

To navigate the diverse ecosystem of decentralized storage systems, we classify them into three categories based on who is storing the files: user-hosted systems, in which all users actively participate in file storage; delegated systems, where the responsibility of providing storage resources is delegated to a select group of computers; and blockchain-based systems, wherein storage providers are compensated with cryptocurrency tokens. All three categories are decentralized, as they are based on distributed systems with open specifications and, in most cases, they allow anyone to join the network. (The classification based on who is hosting the files is important because it determines technical requirements from users as well as the security and availability of the system.)

User-Hosted Decentralized Storage Networks

The use of decentralized storage for archival purposes has its origins dating back to 1996, with the inception of the Eternity Service, followed by Publius, the Free Haven project, Tangler, and Freenet among others. These projects shared a common approach: all users participate in the replication of data creating a resilient, redundant, distributed data storage system that makes it difficult to take down content and ensures long-term accessibility. As long as even one of the computers hosting a file remains accessible, then that file remains available within the system. Consequently, it becomes exceedingly challenging for an oppressive agent to take down all computers hosting a specific file, particularly when these computers are located outside that agent’s control. Users can retrieve files from any of the computers with a copy, without having to rely on a central archive. The archive remains live even when a subset of the hosting computers goes offline.

This first category of decentralized storage systems highlights two building blocks, helpful for human rights archiving: cryptography and peer-to-peer networking. Let us look at these two features in more detail, focusing first on cryptography.

Cryptography ensures data integrity and user privacy. Cryptographic hashes and digital signatures help verify that the data has not been modified or corrupted, even when stored or transferred over untrusted computers. Anonymity and plausible deniability, achieved by cryptographic techniques and by mixing together the activity of multiple users, allow users to host and access the archive without fear, as their requests cannot be linked to their identity, and they cannot be held accountable for content they store and transfer.

Here is an illustration of how these networks work:

Diagram 1: In user-hosted decentralized storage networks all users participate in hosting and transferring encrypted parts of the files.

In order to preserve the privacy of requests, rather than connecting directly with the target computer, users route their traffic through intermediary computers. In principle, given that everyone is relaying requests on behalf of others, it would be difficult to prove that a specific user is providing or accessing a specific piece of data, rather than merely acting as an intermediary relay. Each computer in the network creates a constant end-to-end encrypted traffic pattern by also relaying the requests of others, making it difficult for adversaries who observe network activity to figure out when a user is active or idle. Therefore, the traffic from any given computer is the same whether the user is interacting with files or just sitting idle. This end-to-end encryption allows for “plausible deniability” whereby the operator of any computer in the network cannot know what data they are relaying or storing, and can thus deny any responsibility for it.

Here's how plausible deniability works in detail:

  1. During the publication process, a file is divided into fixed-sized encrypted fragments before being uploaded;

  2. These fragments are then duplicated across multiple computers located in different parts of the network;

    a. Crucially, the computers responsible for storing and transmitting these encrypted fragments do not possess the decryption key and can, therefore, plausibly deny knowledge of their content.

  3. Still, users who know the decryption key can retrieve these fragments from the decentralized storage network and reassemble them into the original file.

Let us now look at the second feature of user-hosted decentralized storage systems that is uniquely suited for human rights archiving: peer-to-peer networking. In peer-to-peer networks like Freenet everyone can openly participate by contributing storage resources from their computer. The diversity of participants enhances the security and availability of the system, rendering it more robust and resilient for archiving purposes, as it is more difficult for divergent/independent entities to simultaneously get coerced, have a computer failure, or delete their copy of the archive. However, allowing everyone to participate can also introduce certain challenges and vulnerabilities. For instance, the dynamic nature of participants, with individuals joining and leaving the network, can make it more difficult to stabilize the system, and can make files or parts of files go offline, resulting in increased complexity and degradation of performance and reliability as the network must make new links and continually update routing decisions.

Moreover, open participation means that malicious entities can also enter the network. In practice, these malicious actors can observe traffic in the network and strategically position themselves to deanonymize targeted users, or even overload the network with dubious content and traffic, thereby compromising the system’s functionality and privacy properties. Another issue is that individual users may not want to store unknown data from other participants. Fears of legal attack from storing shards of, for instance, child sexual abuse material (CSAM), or ethical concerns about assisting political opponents or aggressors may dissuade users from participating – whether anyone else can technically prove their involvement or not.

Recognizing these challenges, researchers and developers of decentralized storage systems have been exploring ways to address them. One direction involves isolating the network of legitimate users from spammers and malicious entities by leveraging social trust and reputation mechanisms. This can be achieved by establishing connections only with known contacts, such as in friend-of-a-friend networks like Tribler and X-Vine, and prioritizing content endorsed by users within their extended social circles, as seen in the Freenet Web of Trust. Furthermore, beyond the security benefits, these social-based strategies also give a significant boost to how well the system performs. Imagine social groups as close-knit circles of users who talk often, live nearby, and share similar interests. These circles minimize the need for involving computers from outside the group, instead emphasizing local-first connections which in turn grant faster data access, increased reliability, and better assurances that relevant content is readily available to the users within the group. While there are many up-sides to this feature, we must also consider that this approach reveals the users' social graphs and can also slow down requests outside the social group.

Delegated Decentralized Storage Networks

An alternative to user-hosted systems is one that divides hosting into more distinct nodes: those who are using a network for downloading data and those who make that data available. Here, the responsibility of providing storage is delegated to a select group of computers, resulting in a more user-friendly experience, somewhat similar to cloud computing models. The utilization of decentralized storage technologies offers the flexibility to establish a scalable, self-organizing storage network and enables availability and privacy properties that are not possible with centralized cloud providers.

Prominent examples of delegated networks are IPFS and Hypercore, which provide a suite of decentralized protocols for addressing, routing, and transferring data that can be adapted to work for both users who store data and those who simply access it. With regards to what content gets stored in the network, both IPFS and Hypercore leave that to the discretion of the storage providers who choose which files they will replicate locally. Files hosted on these networks can be retrieved from any of the computers storing them. On the user side, accessing content is possible either by running their own local nodes or via HTTP requests to Web gateways operated by others, making them accessible and user-friendly for a wide range of applications.

A core concern of hosting an archive on these "collectivized" decentralized storage networks is who is providing the infrastructure, meaning who are the entities that comprise the group of computers that store users' data. These concerns are more pronounced because of the smaller set of infrastructure providers. Who can join the group? What are their powers over the archive? What are the incentives that push them to contribute resources for other users? The answers to these questions ultimately determine important properties of the archive's resilience, privacy, and usability.

The default participation model of IPFS and Hypercore allows everyone to join the network and replicate the content they want. In other words, an open model similar to the user-hosted systems described above. Participating peers, in this case, are in the privileged position to observe who is reading or writing content in the decentralized storage. Also, there is no guarantee that this open group will preserve content in the long run. To ensure consistent user-engagement and long-term preservation, delegated decentralized storage networks can also operate as an invite-only collaborative cluster. For example, Āhau is a collective archive curated and hosted by Māori tribes to “capture, preserve and share cultural heritage, histories and narratives”. Tribal records are replicated among the personal devices of a few volunteer members, while others can access and curate them using their Web browser.

Since human rights archives are often created and maintained by at-risk, targeted groups, privacy and security are non-negotiable features. Given this, let’s consider how cryptographic techniques can be added to delegated decentralized storage networks to increase security. This can involve improvements to the networks’ own design (for instance, IPFS’s recent work on reader privacy), or by integrating anonymising protocols like Tor. To offer users content confidentiality, access control, and availability properties, someone developing this back-end for a human rights archive might seek to apply object-capability models to ensure that only authorized entities can access the data. These models can be applied either on routing protocols (as in OCapN and Ceramic's CACAO), or on the data structures that encapsulate content, as in Secure Scuttlebutt's private messages. Moreover, delegated decentralized storage networks can employ threshold cryptography, breaking files into encrypted overlapping parts, so that they can be reconstructed into the original file even when a threshold of them gets lost. We find examples of this in Tahoe-LAFS and Storj, which build on threshold cryptography to improve resilience and strengthen availability, anticipating cases where a proportion of hosting providers get compromised or delete user files.

Let us now, finally, turn to the blockchain.

Blockchain-Based Decentralized Storage Networks

Most of the decentralized storage systems discussed so far rely largely on volunteers to ensure continuity of the archive. What the blockchain does, however, is ensure continuity of the archive by coupling it with an economic system – one that aims to bypass the power controls of government-issued currencies. The IPFS and Hypercore delegated decentralized storage networks we mentioned in the previous section, for instance, serve as the base of blockchain-incentivized storage networks Filecoin and DatDot, respectively. More specifically, Filecoin’s decentralized storage operates in the following way: IPFS is the peer to peer data syncing & DHT management protocol, and Filecoin is an implementation of IPFS that layers on a cryptocurrency-based incentive system used to pay for storage, among other things. Other popular projects in the same vein include Swarm, Arweave and Codex.

Blockchains, initially introduced with Bitcoin, incorporate cryptographic, governance, and monetary policy protocols to create token economies over decentralized ledgers. Such tokens can also be exchanged with other currencies, and therefore become an attractive remuneration option for infrastructure providers. When it comes to decentralized storage systems specifically, the token economy builds an open market around renting storage resources. In practice, users store their files for a period of time by paying with blockchain utility tokens that get fairly distributed among the entities that host them. This is facilitated by the blockchain consensus protocol that incentivizes providers who successfully store content, and comparably punishes those who fail to do so.

One thing that we need to consider when using blockchain ledgers for human rights archives is that all storage operations are irreversibly recorded in the public blockchain history. Therefore, all publisher activity and archive metadata can be traced back in time, a property that--despite the drawbacks mentioned below--can be beneficial when we care about the chronological sequence of such events: systems like OpenTimestamps and hashd0x, for example, rely on blockchain records to provide trusted timestamps, content integrity and source authentication of uploaded files and their metadata. Notably, user identities in blockchains can be considered pseudonymous, given that, instead of being issued centrally, identities are derived from cryptographic public keys and therefore do not include identifying information. Yet this also means that everyone can track a pseudonymous identity's activities and potentially infer the real owners. On top of that, cryptographic key management is complex for users, and we are eager to see how techniques employed in other blockchain systems – like multi-sig and social recovery wallets, account abstraction, and zero-knowledge proofs for set membership – can ease adoption and improve privacy and security. These techniques require an additional feature for executing applications, called smart contracts, which is also finding its way into blockchain decentralized storage networks (see the Filecoin Virtual Machine) and opens new possibilities for decentralized applications that can operate programmably on the stored data without the need to trust any of the infrastructure providers.

A significant limitation of blockchain-based decentralized storage options has to do with the actual monetary cost a community would incur for hosting their archive there. Hosting data on a global network means competing with all other applications that bid their own price for the available storage space, posing a challenge to the sustainability of the archive as prices may mount over time. Hopefully, blockchains will begin to unlock new directions in collaboration, reputation and monetization models that archiving communities can adapt to their ways. One initiative that directly taps into this potential while specifically appealing to archivists, is Alex., an archival platform that aims to preserve human history and culture built over Arweave. Alex. creates publicly-funded decentralized archives where, in return for their tokens for renting storage space in the blockchain network, funders receive a new type of token representing their contributions – a token which can play a role in the curation and longevity of the archive by representing voting power, credibility, or other benefits to contributors and funders.

Take-aways

We conclude this reflection on the diverse landscape of decentralized storage technologies by highlighting some key takeaways. Firstly, decentralization fundamentally alters the dynamics of control and responsibility, aligning with participatory archiving processes essential for censorship resistance and privacy protection. However, decentralized storage isn't a one-size-fits-all solution as various techniques address different aspects of archival stewardship, such as curation, dissemination, and authenticity. Moreover, the multitude of options available in decentralized storage necessitates careful consideration of trade-offs encompassing usability, costs, security, privacy, and resilience. Despite the progress made, a cohesive framework to serve Decentralized Autonomous Communities remains elusive, highlighting the need for integration among disparate technological developments. While blockchain has garnered significant attention and investment for decentralization endeavors, its readiness to fully meet the demands of human rights archivists is still evolving. Although blockchain holds promise, particularly in ensuring integrity and authenticity, it requires further refinement to align with the specific needs of human rights archiving practices, drawing inspiration from existing systems while charting a path forward towards greater inclusivity and freedom.

Finally, human rights archivists must carefully assess which decentralized technologies to use, and in what combinations, in order to best safeguard or deploy their data. As these technologies are still nascent, human rights archivists are uniquely positioned to infuse civil liberties like transparency and accountability into these technologies. Their expertise, advocacy, and active participation in the decentralized storage ecosystem can ensure that these tools are designed and implemented in a way that supports justice and makes preserving truth to power safer and more effective.

Through ongoing research, advocacy, and active participation in the decentralized storage ecosystem, OpenArchive works to identify challenges and opportunities with the goal of improving upon these systems to ensure they are beneficial to human rights archivists.

Benefits and Drawbacks of These Decentralized Storage Options

A Deeper Dive into Benefits and Drawbacks:

User-Hosted Decentralized Storage Networks

Benefits

  • Can hide user activity (anonymity) and remove liability for content they store or transfer (plausible deniability).
  • Data is hosted by a large and diverse user base.
  • Social trust and reputation mechanisms can marginalize adversaries and improve performance for legitimate users.
  • Provide tools for building decentralized applications, such as social media, blogging, whistleblowing, and source code repositories

Drawbacks

  • All users need to host data and to be constantly connected to the network.
  • Computers joining and leaving the network make it slow and unreliable.
  • Adversaries can attack the network from the inside – e.g. by overloading it with spam data – and can observe user activity.
  • Users need to manage cryptographic keys.
  • Users may lack control over what data they store or support.


Delegated Decentralized Storage Networks

Benefits

  • Convenient for users as they don’t store content themselves.
  • Being self-organized, they offer flexible and scalable deployment.
  • Can be public or private.
  • Cryptographic techniques improve privacy and reliability.

Drawbacks

  • Small group of infrastructure providers gives them more power and liability over the archive compared to the other two options.
  • Unclear incentives for infrastructure providers.


Blockchain-based Decentralized Storage Networks

Benefits

In addition to the benefits of delegated systems, they also provide:

  • High availability, resilience and resistance to censorship.
  • Cryptocurrency tokens for incentivizing a large and diverse group of infrastructure providers.
  • Support smart contracts for building decentralized applications.
  • Enable new models for collaboration and financial sustainability.
  • Can be used as a trusted timestamping service.

Drawbacks

  • Storage costs in cryptocurrency.
  • Users need to manage cryptographic keys.
  • Users have pseudonymous accounts whose activity is traceable in the public blockchain history.
  • Cannot delete actions from the blockchain history.