In mid October, 2019, Yahoo announced that it will “no longer host user created content” on Yahoo Groups as of December 14, 2019, which is a rather euphemistic description of its plans to delete countless message histories, files, polls, links, photos, general attachments and more (full list). This history of Yahoo Groups, created over the course of 18 years of the service’s offering, was met with a lot of concern across its user base, the web archiving community, and press. With less than 2 months warning of this planned deletion we have an unfortunately typical structure of an emergency web archiving situation: a web service closing down on short notice with affected users being spread across lots of fragmented communities and valuable content often only accessible via login. The Webrecorder project team wanted to explore how much effort producing a custom behavior for Webrecorder’s Autopilot system to capture content from Yahoo Groups would take and what results could be expected from such a quick project.
Webrecorder’s automated collecting feature, Autopilot, now has a customized behavior for capturing public Yahoo Groups. It was a lot of work to iteratively develop and test this feature. The results reached were made possible through steady dialog between developers working on behavior development and non technical staff doing many hours of in depth testing.
We are pleased to announce that it is now possible to use Autopilot to obtain a ‘representative sample’ of content in public Yahoo Groups. It would not be accurate to say the capture is a full copy of the original. Webrecorder’s Autopilot essentially behaves like a very patient user, clicking links and buttons to access as many items as possible during a capturing session. The result is a ‘high fidelity’ capture, which maintains the whole layout (spatial relationship of items), branding, and overall look and feel of the site. Since there are many factors limiting the duration of a capturing session—the Webrecorder user terminating it, technical issues on the user’s end like a loss of internet connection, or rate-limiting by Yahoo—it is very unlikely that all of the contents stored in a group would be captured.
To get a complete export of a group’s contents, one can use Yahoo’s export feature (if you have the administrative rights to do so) or a Python script to scrape the data. Though all messages and file attachments would be present in such an export, it would be separated from the context users of Yahoo Groups would create and encounter that data, since the web interface provided by the service would not be saved.
A ‘representative sample’ is a concept we are borrowing from the wider archival field. With large archives it’s usually not possible to “keep it all” so a selected grouping is preserved to represent the larger entity (a sample representing the whole). Archives often provide evidence of something rather than serve as a one to one recreation of the original, so why not when necessary apply this expectation of representation to web archives as well?
Accessing a chosen subset of the content enables someone to get a sense of the Yahoo Groups interface and see the type of messages exchanged. With some Yahoo Groups many members posted messages and read replies via their own email client since Yahoo Groups could be set up to forward the activity of the group to email accounts. Given the user experience was sometimes outside the Yahoo Groups interface, it’s arguable that staying true to the online environment is even essential. Yet on closer analysis of the messages in the Yahoo Groups it’s clear that some folks used context specific features like specific emojis (rather “smileys” given their age) that would not be displayed when messages were accessed outside of Yahoo’s web services.
On to some examples! This capture of a public Yahoo Group, RIBIRDS, provides an example of Webrecorder’s capabilities to collect messages (including images) on a public Yahoo Group on the open web and available without needing to log in to a Yahoo account. This collection contains materials going back to some point in 2018, though the group has been active since 2012.
Another example is the ‘archive-crawler’ group. A 2+ hour autopilot run gets a significant portion of the content available (hundreds of messages of the thousands contained in the group). This is a good sample representing the experience of using Yahoo Groups and many messages exchanged by this group focused on archival web crawlers. This sample of captured messages can be browsed via the navigation bar on the left side or in many cases using the buttons in the Yahoo interface for moving from one message to the next.
With private groups, Autopilot’s success was inconsistent and the amount of content captured was not predictable, but could still be quite useful to create a representative sample. Tests on private groups are not shared in this blog post, particularly since sometimes those groups were kept private for a reason, including safety concerns for those posting to the group.
Regarding private groups that require a log in action for access it is also worth noting that your login credentials may be written into the web archive (WARC file) you create. If you plan to make this WARC file available to others they might be able to analyze the contents of this file and find the credentials. While someone searching a WARC for log in credentials is not a terribly likely scenario, if you are doing web archiving on behalf of someone else (e.g. your employer) it is advisable to use an account made specifically for web archiving activities rather than your own personal credentials or to change your password after you conclude your web archiving work. If you keep your WARC files only for yourself, no one would be able to access your credentials. The Webrecorder desktop app allows to use “preview” when starting a session: in that mode, you can interact with a website without any of the traffic being written to a WARC file. This makes it possible to log in to a site in preview mode, and only afterwards switch to capturing, which starts writing traffic to the WARC file.
Given this Autopilot behavior is brand new and Yahoo itself is undergoing changes it’s especially important to test your captures for quality. In other words, browse your collections and ask yourself ‘did I get what I think I got?’ Is the content contained what you want or need? Is there anything missing that is so significant the archive would be misleading without it? In some cases you may be able to patch additional information into your collection, and time is of the essence to do so since Yahoo will be deleting so much content in Yahoo Groups as of December 14, 2019.
Please let us know what you think! The best way to send specific feedback is via the error reporting feature in Webrecorder’s interface (available on all pages near the upper right corner of the screen) or by sending us a message at firstname.lastname@example.org.