In this page

Foreword by Midori

Ever-growing, uncontrolled, unreliable information is a long-time problem in the enterprise. This article is the story of LinkedIn, the largest business-oriented social network, about how they continuously measure and improve the quality of their information.

Technology environment: LinkedIn is using Confluence (by Atlassian) to manage its knowledge base. Their Confluence instance is enhanced with the Better Content Archiving (by Midori).

Better Content Archiving offers several configuration options to implement the optimal information quality- and archiving strategy for any organization, big or small. LinkedIn, to find their optimal strategy, experimented with multiple approaches and communicated with the app developers regularly. When they finally succeeded, the team decided to share their story with the Confluence community, to help large-scale Confluence users to solve similar problems.

The problem

LinkedIn's internal wiki is used by 16,000 employees and consultants. It contains 120,000 unique pages, and 55,000 of them have not been updated for 2 years or longer. We waited too long (10 years) to deploy an archiving strategy at LinkedIn, and the user frustration level was very high.

Here are some deployment tips and lessons learned based on our experience with Better Content Archiving 4.3.0 running in Confluence 5.7.4. We hope this information helps you be successful with a large-scale enterprise deployment.

Tip 1: Install the Better Content Archiving app as soon as possible

Don't put it off. The sooner employees are accustomed to automatic archiving the better. As junk pages pile up over time, the wiki loses its credibility and usefulness.

Tip 2: First use a staging test instance for a dry run

Doing this gives you a preview of an archived wiki, identifies any issues or challenges, gives management a picture of what archiving looks like (assists with justifying the cost), and refines your deployment game plan.

In future app versions we plan to offer an actual dry run feature.
Although it will not completely replace the staging environment experiments, it will work like this: "show me what would happen if I ran the archiving job with the current settings - but do not actually archive anything!".
Dry runs will significantly reduce the efforts to find the best archiving settings for your content.

Aron Gombas, lead developer of Better Content Archiving for Confluence

Tip 3: Phase in archiving gradually

Start small. Don't surprise your users by archiving hundreds of pages at once.

We archived our wiki over several batches like this:

  1. Batch 1: 60 pages, small spaces 1 (a lightweight rollout)
  2. Batch 2: 100 pages, small spaces 2
  3. Batch 3: 1000 pages, medium spaces 1
  4. Batch 4: 1300 pages, medium spaces 2
  5. Batch 5: 1800 pages, medium spaces 3
  6. Batch 6: 1000 pages, medium spaces 4
  7. Batch 7: 2900 pages, large spaces 1
  8. Batch 8: 2900 pages, large spaces 2
  9. Batch 9: 2000 pages, large spaces 3
  10. Batch 10: 1400 pages, large spaces 4
  11. Batch 11: 7600 pages, jumbo space 1
  12. Batch 12: 32,000 pages, jumbo space 2

Tip 4: Archive large wiki spaces in small batches

At LinkedIn, we had one Engineering wiki space containing 70,000 pages. 32,000 of these had to be archived. In a staging environment (with the mail server disabled), we attempted to expire and archive all the pages at once. Unfortunately, it took several weeks for the app to run and crashed the staging wiki several times. We changed our approach and decided to expire/archive pages using smaller date ranges. In the end, a 2,000 page archiving job took approximately 1.5 days to run:

  1. Archive in 1 year batches:
    1. 2920+ days (8+ years), 308 pages
    2. 2555+ days (7+ years), 1500 pages
    3. 2190+ days (6+ years), 1520 pages
    4. 1825+ days (5+ years), 2468 pages
  2. Archive in 6 month batches:
    1. 1642+ days (4.5+ years), 2000 pages
    2. 1460+ days (4+ years), 2000 pages
    3. 1277+ days (3.5+ years), 3210 pages
  3. Archive in 3 month batches:
    1. 1186+ days (3.25+ years), 2500 pages
    2. 1095+ days (3+ years), 2500 pages
    3. 1003+ days (2.75+ years), 3162 pages
    4. 912+ days (2.50+ years), 3625 pages
    5. 821+ days (2.25+ years), 3655 pages
    6. 760+ days (2+ years), 2862 pages

Pre-5.1.0 app versions implement the following archiving strategy: copy the page to the archive space, then trash that in its original space. (Note that it requires replicating the page content, its comments, its labels and even its attachments.) This strategy is resource intensive, although the heaviest parts are done in Confluence core, not in the app.
5.1.0 and newer versions offer an alternative, the "move" strategy. As the name suggests, instead of copying data, it will move that, resulting in improved performance. (We will also keep the "copy and trash" strategy as an option, which also has its own merits.)

Aron Gombas, lead developer of Better Content Archiving for Confluence

Tip 5: Run the app during off hours

We measured a 30% spike in CPU usage during large archiving jobs (2,000 pages and greater). To save CPU and memory resources, we decided to run the app on scheduled intervals when most employees are not using the wiki: Saturdays at 3:00 A.M.

The "move" strategy mentioned previously will decrease this load.
Nevertheless running the job out of regular working hours is a good idea, to avoid conflicts.

Aron Gombas, lead developer of Better Content Archiving for Confluence

Tip 6: Beware of email filters

Many employees filter the mail sent by Confluence, so they might not see the expiration emails sent by the app. If you are the person running the app, copy yourself as a Space Admin on all messages. You'll be able to reference a paper trail of communication, if needed.

You could alternatively select any user as "supervisor". It does not require changing the space admin permissions.

Aron Gombas, lead developer of Better Content Archiving for Confluence

Tip 7: Remove space watchers before they hate you

When the app moves pages into the archive, all space watchers are email notified. Contact these users and make them aware of the issue. No space watchers wants to receive 10,000 emails in their inbox!

Email flooding should not occur in more recent app versions.
We invested major efforts in suppressing Confluence's built-in notifications.

Aron Gombas, lead developer of Better Content Archiving for Confluence

Tip 8: Give employees enough time to keep pages

Our strategy was to expire pages, wait for 30 days, and then archive them.

Tip 9: Brace yourself for pack rats

Some employees do not want their pages archived, regardless of the age or value of the content.

Tip 10: Prepare to fix or remove broken links everywhere

First do a dry run in a staging environment (not production). This approach enables you to preview all the broken links and pages that must be fixed after the archiving job has completed.

Tip 11: Create a FAQ support page

In your customized Velocity email template, include a link to an archiving support page with FAQs and contact information.

Tip 12: Empty the trash

After a space had been archived successfully, we waited for 2 months then emptied the space's trash. Keeping only archived pages (not trashed copies) is sufficient for page restoration.

Finally

Wishing you success,

LinkedIn's Engineering Technical Documentation Team
Greg McMillan, Andrea Dutra, Clyde Higaki, Deepa Jacob, Ursula Johnson