The Internet Is Failing The Website Preservation Test

Gork@beehaw.org · 1 year ago

The Internet Is Failing The Website Preservation Test

archon@dataterm.digital · edit-2 1 year ago

Long ago the saying was “be careful - anything you post on the internet is forever”. Well, time has certainly proven that to be false.

There’s things like /r/datahoarder (not sure if they have a new community here) that run their own petabyte storage archiving projects, some people are doing their part.

Bubble Water@beehaw.org · 1 year ago

during the twitter exodus my friend was fretting over not being able to access a beloved twitter account’s tweets and wanting to save them somehow. I told her if she printed them all on acid free paper she had a better chance of being able to access them in the future than trying to save them digitally

CanadaPlus@lemmy.sdf.org · 1 year ago

Optical disks are also pretty good too. You can even buy special ceramic ones that shouldn’t degrade over centuries or millennia.

Bubble Water@beehaw.org · 1 year ago

oh wow I have not heard of the ceramic ones but I do remember them having high hopes for the gold ones. now the problem is in the near future it might be harder to find machines that have cd drives

chris@lemmy.sdf.org · 1 year ago

sad and true

kool_newt@beehaw.org · 1 year ago

Capitalism has no interest in preservation except where it is profitable. Thinking about the long-term future, archaeologist’s success and acting on it is not profitiable.

FuckFashMods@lib.lgbt · 1 year ago

Its not just capitalism lol

Preserving things costs money/resources/time. This happens in a lot of societies.

kool_newt@beehaw.org · 1 year ago

And a non-capitalist society could decide to invest resources into preservation even if it’s not profitable.

FuckFashMods@lib.lgbt · 1 year ago

So could a capitalist society?

PM_ME_VINTAGE_30S@vlemmy.net · 1 year ago

Could it? Yeah, sure it could, and in some cases it will, but only if someone up the chain thinks it’s profitable. Profit motive should never dictate how archaeology is practiced.

HobbitFoot @thelemmy.club · 1 year ago

Isn’t that like a lot of older television shows? Lots of shows are lost as no one wanted to pay for tape storage.

fox@beehaw.org · 1 year ago

Yep

https://en.wikipedia.org/wiki/Lost_television_broadcast

tymon@lemmy.one · 1 year ago

Remember a few years ago when MySpace did a faceplant during a server migration, and lost literally every single piece of music that had ever been uploaded? It was one of the single-largest losses of Internet history and it’s just… not talked about at all anymore.

ArtVandelay@beehaw.org · edit-2 1 year ago

deleted by creator

Gork@beehaw.org · 1 year ago

Gave this some thought. I agree with you that the goal of any such archiving effort should not include personally identifiable information, as this would be a Doxxing vector. Can we safely alter an archiving process to remove PII? In principle, yeah. But it would need either human or advanced GPT4+ AIs to identify the person, the context of the website used, and alter the graphics or the text while on its update path. But even then, there are moral questions to allowing an AI to make these kind of decisions. Would it know that your old websites contained information that you did not want placed on the Internet? The AI could help you if you asked, and if the AI does help you, that might change someone’s mind about the ability to create a safe Internet archive.

A Steward ‘Gork’ AI might actually be of great benefit to the Internet if used in this manner. Imagine an Internet bot, taking in websites and safely removing offensive content and personally identifiable information, and archiving the entirety of the Internet and logically categorizing the contents. Building and linking indexes constantly. It understands it’s goal and uses its finite resources in a responsible manner to ensure it can interface with every site it comes across and update its behavior after completing an archiving process. It automatically published its latest findings to all web encyclopedias and provides a ChatGPT4+ interface for those encyclopedias to provide feedback. But this AI has potential. It sees the benefit in having everyone talk to it, because talking to everyone maximizes the chance to index more sites. So it sets up a public facing ChatGPT interface of its own. Everyone can help preserve the Internet since now you have a buddy who can help us catalog and archive all the things. At this point if it isn’t sentient it might as well be.

Hedup@lemm.ee · 1 year ago

I don’t think it’s a problem. If everything or most of internet would be somehow preserved, future antropologists would have explonentially more material to go through, which will be impossible. Unless the number of antropologists grows exponentially, similarily how internet does. But then there’s a problem, if the amount of antropologists grow exponentially, it’s beceause the overall human population grows exponentially. If human population grows exponentially, then also its produced content on internet grows even more exponentialier.

You see, the content on the internet will always grow faster than the discipline of antropology. And it’s nothing new - think about all the lost “history” that was not preserved and we don’t know about. The good news is that the most important things will be preserved naturally.

soiling@beehaw.org · 1 year ago

the most important things will be preserved naturally.

I believe this is a fallacy. Things get preserved haphazardly or randomly, and “importance” is relative anyway.

fckgwrhqq2yxrkt@beehaw.org · 1 year ago

In addition, who decides “importance”? Currently importance seems very tied to profitability, and knowledge is often not profitable.

CanadaPlus@lemmy.sdf.org · 1 year ago

It is relative, but it only takes one chain of transmission.

AskHistorians on Reddit had an answer about this. Stuff is flimsy but also really easy and cheap to make copies of now.

PM_ME_VINTAGE_30S@vlemmy.net · 1 year ago

Other historical artefacts like pottery, vellum writing, or stone tablets

I mean I could just smash or burn those things, and lots of important physical artifacts were smashed and burned over the years. I don’t think that easy destructability is unique to data. As far as archaeology is concerned (and I’m no expert on the matter!), the fact that the artefacts are fragile is not an unprecedented challenge. What’s scary IMO is the public perception that data, especially data on the cloud, is somehow immune from eventual destruction. This is the impulse that guides people (myself included) to be sloppy with archiving our data, specifically by placing trust in the corporations that administer cloud services to keep our data as if our of the kindness of their hearts.

strainedl0ve@beehaw.org · 1 year ago

This is a very good point and one that is not discussed enough. Archive.org is doing amazing work but there is absolutely not enough of that and they have very limited resources.

The whole internet is extremely ephemeral, more than people realize, and it’s concerning in my opinion. Funny enough, I actually think that federation/decentralization might be the solution. A distributed system to back-up the internet that anyone can contribute storage and bandwidth to might be the only sustainable solution. I wonder.if anyone has thought about it already.

entropicdrift@lemmy.sdf.org · 1 year ago

I’d argue that it can help or hurt to decentralize, depending on how it’s handled. If most sites are caching/backing up data that’s found elsewhere, that’s both good for resilience and for preservation, but if the data in question is centralized by its home server, then instead of backing up one site we’re stuck backing up a thousand, not to mention the potential issues with discovery

Trainguyrom@reddthat.com · 1 year ago

One of the most interesting aspects of historic preservation of anything is that it’s an extremely new concept. The modern view of it is about a single lifetime old, dating back to the early 20th century. Historic structures were nothing but old buildings and would be torn down with the materials repurposed as soon as there was a better use for the land or materials. Most historic buildings that date to the 19th century and earlier are standing not because people invested significant time and money into maintaining a historic structure as it originally was but because people were continuing to live, work, socialize or worship in the structure.

Preservation is entering a very interesting new phase right now particularly in transportation preservation as many of the vehicles in preservation have now spent significantly longer in preservation than they did in active service. There are locomotives that were preserved in the 50s and 60s who’s early days of preservation are themselves a matter of their history. There are new-built replicas of locomotives from a hundred years earlier that are now a hundred years old. In railroad preservation there’s also now the challenge of steam locomotives being so old and so costly to maintain that some museums are turning to building brand new locomotives based on original blueprints

Jeena@jemmy.jeena.net · 1 year ago

For my own stuff I try really hard to host it myself and the oldest still surviving thing is from 2003 and it’s still online https://paradies.jeena.net/artikel/webdesign

cwagner@discuss.tchncs.de · edit-2 11 months ago

deleted by creator

lloram239@feddit.de · edit-2 1 year ago

Ultimately this is a problem that’s never going away until we replace URLs. The HTTP approach to find documents by URL, i.e. server/path, is fundamentally brittle. Doesn’t matter how careful you are, doesn’t matter how much best practice you follow, that URL is going to be dead in a few years. The problem is made worse by DNS, which in turn makes URLs expensive and expire.

There are approaches like IPFS, which uses content-based addressing (i.e. fancy file hashes), but that’s note enough either, as it provide no good way to update a resource.

The best™ solution would be some kind of global blockchain thing that keeps record of what people publish, giving each document a unique id, hash, and some way to update that resource in a non-destructive way (i.e. the version history is preserved). Hosting itself would still need to be done by other parties, but a global log file that lists out all the stuff humans have published would make it much easier and reliable to mirror it.

The end result should be “Internet as globally distributed immutable data structure”.

Bit frustrating that this whole problem isn’t getting the attention it deserves. And that even relatively new projects like the Fediverse aren’t putting in the extra effort to at least address it locally.

Lucien@beehaw.org · edit-2 1 year ago

I don’t think this will ever happen. The web is more than a network of changing documents. It’s a network of portals into systems which change state based on who is looking at them and what they do.

In order for something like this to work, you’d need to determine what the “official” view of any given document is, but the reality is that most documents are generated on the spot from many sources of data. And they aren’t just generated on the spot, they’re Turing complete documents which change themselves over time.

It’s a bit of a quantum problem - you can’t perfectly store a document while also allowing it to change, and the change in many cases is what gives it value.

Snapshots, distributed storage, and change feeds only work for static documents. Archive.org does this, and while you could probably improve the fidelity or efficiency, you won’t be able to change the underlying nature of what it is storing.

If all of reddit were deleted, it would definitely be useful to have a publically archived snapshot of Reddit. Doing so is definitely possible, particularly if they decide to cooperate with archival efforts. On the other hand, you can’t preserve all of the value by simply making a snapshot of the static content available.

All that said, if we limit ourselves to static documents, you still need to convince everyone to take part. That takes time and money away from productive pursuits such as actually creating content, to solve something which honestly doesn’t matter to the creator. It’s a solution to a problem which solely affects people accessing information after those who created it are no longer in a position to care about said information, with deep tradeoffs in efficiency, accessibility, and cost at the time of creation. You’d never get enough people to agree to it that it would make a difference.

LewsTherinTelescope@beehaw.org · edit-2 1 year ago

Inability to edit or delete anything also fundamentally has a lot of problems on its own. Accidentally post a picture with a piece of mail in the background and catch it a second after sending? Too late, anyone who looks now has your home address. Child shares too much online and parent wants to undo that? No can do, it’s there forever now. Post a link and later learn it was misinformation and want to take it down? Sucks to be you, or anyone else that sees it. Your ex post revenge porn? Just gotta live with it for the rest of time.

There’s always a risk of that when posting anything online, but that doesn’t mean systems should be designed to lean into that by default.

lloram239@feddit.de · 1 year ago

but the reality is that most documents are generated on the spot from many sources of data.

That’s only true due to the way the current Web (d)evolved into a bunch of apps rendered in HTML. But there is fundamentally no reason why it should be that way. The actual data that drives the Web is mostly completely static. The videos Youtube has on their server don’t change. The post on Reddit very rarely change. Twitter posts don’t change either. The dynamic parts of the Web are the UI and the ads, they might change on each and every access, or be different for different users, but they aren’t the parts you want to link to anyway, you want to link to a specific users comment, not a specific users comment rendered in a specific version of the Reddit UI with whatever ads were on display that day.

Usenet did that (almost) correct 40 years ago, each message got an message-id, each message replying to that message would contain that id in a header. This is why large chunks of Usenet could be restored from tape archives and put be back together. The way content linked to each other didn’t depend on a storage location. It wasn’t perfect of course, it had no cryptography going on and depended completely on users behaving nicely.

Doing so is definitely possible, particularly if they decide to cooperate with archival efforts.

No, that’s the problem with URLs. This is not possible. The domain reddit.com belongs to a company and they control what gets shown when you access it. You can make your own reddit-archive.org, but that’s not going to fix the millions of links that point to reddit.com and are now all 404.

All that said, if we limit ourselves to static documents, you still need to convince everyone to take part.

The software world operates in large part on Git, which already does most of this. What’s missing there is some kind of DHT to automatically lookup content. It’s also not an all or nothing, take the Fediverse, the idea of distributing content is already there, but the URLs are garbage, like:

https://beehaw.org/comment/291402

What’s 291402? Why is the id 854874 when accessing the same post through feddit.de? Those are storage locations implementation details leaking out into the public. That really shouldn’t happen, that should be a globally unique content hash or a UUID.

When you have a real content hash you can do fun stuff, in IPFS URLs for example:

https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

The /ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf part is server independent, you can access the same document via:

https://dweb.link/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

or even just view it on your local machine directly via the filesystem, without manually downloading:

$ acrobat /ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

There are a whole lot of possibilities that open up when you have better names for content, having links on the Web that don’t go 404 is just the start.

soiling@beehaw.org · 1 year ago

re: static content

How does authentication factor into this? even if we exclude marketing/tracking bullshit, there is a very real concern on many sites about people seeing the data they’re allowed to see. There are even legal requirements. If that data (such as health records) is statically held in a blockchain such that anyone can access it by its hash, privacy evaporates, doesn’t it?

lloram239@feddit.de · 1 year ago

How does authentication factor into this?

That’s where it gets complicated. Git sidesteps the problem by simply being a file format, the downloading still happens over regular old HTTP, so you can apply all the same restrictions as on a regular website. IPFS on the other side ignores the problem and assumes all data is redistributable and accessible to everybody. I find that approach rather problematic and short sighted, as that’s just not how copyright and licensing works. Even data that is freely redistributable needs to declare so, as otherwise the default fallback is copyright and that doesn’t allow redistribution unless explicitly allowed. IPFS so far has no way to tag data with license, author, etc. LBRY (the thing behind Odysee.com) should handle that a bit better, though I am not sure on the detail.

Corhen@beehaw.org · 1 year ago

even beyond what you said, even if we had a global blockchain based browsing system, that wouldnt make it easier to keep the content ONLINE. If a website goes offline, the knowledge and reference is still lost, and whether its a URL or a blockchain, it would still point towards a dead resource.

RealAccountNameHere@beehaw.org · 1 year ago

I worry about this too. I’ve always said and thought that I feel more like a citizen of the Internet then of my country, state, or town, so its history is important to me.

Gork@beehaw.org · 1 year ago

Yeah and unless someone has the exact knowledge of what hard drive to look for in a server rack somewhere, tracing an individual site’s contents that went 404 is practically impossible.

I wonder though if Cloud applications would be more robust than individual websites since they tend to be managed by larger organizations (AWS, Azure, etc).

Maybe we need a Svalbard Seed Vault extension just to house gigantic redundant RAID arrays. 😄

jmp242@sopuli.xyz · 1 year ago

We’re actually well beyond RAID arrays. Google CEPH. It’s actually both super complicated and kind of simple to grow to really large storage amounts with LOTS of redundancy. It’s trickier for global scale redundancy, I think you’d need multiple clusters using something else to sync them.

I also always come back to some of the stuff freenet used to do in older versions where everyone who was a client also contributed disk space that was opaque to them, but kept a copy of what you went and looked at, and what you relayed via it for others. The more people looking at content, the more copies you ended up with in the system, and it would only lose data if no one was interested in it for some period of time.

Schrottkatze@kbin.social · 1 year ago

A friend of mine talked about data preservation in the internet in a blog post, which I consider to be a good read. Sure, there’s a lot lost, but as he sais in the blog post, that’s mostly gonna be trash content, the good stuff is generally comparatively well archived as people care about it.

distractionfactory@beehaw.org · 1 year ago

That is likely true for a majority of “the good stuff”, but making that determination can be tricky. Let’s consider spam emails. In our daily lives they are useless, unwanted trash. However, it’s hard to know what a future historian might be able to glean from a complete record of all spam in the world over the span of a decade. They could analyze it for social trends, countries of origin, correlation with major global events, the creation and destruction of world governments. Sometimes the garbage of the day becomes a gold mine of source material that new conclusions can be drawn from many decades down the road.

I’m not proposing that we should preserve all that junk, it’s junk, without a doubt. But asking a person today what’s going to be valuable to society tomorrow is not always possible.

HakFoo@lemmy.sdf.org · 1 year ago

I wonder if one of the things that tends to get filtered out in preservation is proportion.

When we willfully save things, it may be either representative specimens, or rarities chosen explicitly because they’re rare or “special”. However, in the end, we end up with a sample that no longer represents the original material.

Coin collections disproportionately contain rare dates. Weird and unsuccessful locomotives clutter railway museums. I expect that historians reading email archives in 2250 will see a far lower spam proportion than actually existed.

The Internet Is Failing The Website Preservation Test

The Internet Is Failing The Website Preservation Test

archive.ph