In this page
Techniques
Blacklist spaces
Reduce the frequency of jobs
Decrease data limits
Increase the application memory (heap size)
Other techniques
Tune the frequency of page view and page update journal writes
Overview
This page describes tuning Better Content Archiving for improved performance, especially when being used in massive Confluence instances. By "massive" we mean instances that contain 200 - 2,000 spaces and 500,000 - 2,000,000 pages.
If you use a smaller instance, you will unlikely face performance problems. In that case, you can just skip this page and leave every setting on the default value.
Please note that albeit we do our best to make sure that the Better Content Archiving app performs well under a wide variety of circumstances, there's no single configuration that is best for everyone's environment and needs. If you are having performance problems:
- Read this page and make those practical changes that are applicable to your situation.
- Ask for our help .
Techniques
Blacklist spaces
Nothing improves performance better than reducing the size of the problem to solve. If you have spaces, for which it makes no sense to track the content lifecycle, add those to the blacklist. This will effectively make them non-existing for the app.
Spaces that should typically be blacklisted are the ones that contain static (non-changing) type of information, like meeting notes, contracts or reports.
Reduce the frequency of jobs
If the background jobs too frequently executed overload your system, consider executing those less frequently. If, for instance, re-calculating the content quality statistics 4 hours generates too much work, why not doing thus only once a day, preferable outside working hours?
You can do this in easily by configuring the schedule for these jobs:
- Better Content Archiving: Analyze Content Quality (re-calculates the content quality statistics)
- Better Content Archiving: Find and Archive Expired Content (starts the content lifecycle job)
This simple configuration change can significantly reduce the load on your instance, at the cost of some latency.
Decrease data limits
Since version 6.0.0, the app automatically reduces the space and page data to an "actionable" size. Please note that the app will process all pages in all non-blacklisted spaces, and these so-called data limits will only be applied at the end of the content lifecycle job (when saving the content audit log events or sending out notification emails). Therefore, blacklisted spaces should be your primary tool to reduce the problem size, while data limits should be used in an addition to those.
Data limiting aims to both prevent your users from information overload and your system from wasting resources on useless work. If the data were not reduced and you had 100,000 expired pages in 500 spaces, for example, your users would receive gigantic notification emails with super-long page lists. Not only those would be unreadable, people would completely ignore them due to the "where do I start fixing this?" problem.
To avoid this, data limits are applied in 3 phases:
- Content audit log events: only the first N spaces and the first M pages per space are saved to the event stream.
- Notifications: only the first N spaces and the first M pages per space are used to find the addressees for the notification emails.
- Emails: only the first N spaces and the first M pages per space are listed in the notification emails.
Although the limits applied to the data were hard-wired in app version 6.0.0, we made them configurable in 6.1.0. We think that the defaults fit the majority of the Confluence instances, but you may eventually want to tune those to your specific needs (e.g. longer page lists in emails).
The table below displays the parameters which can be configured via standard Java system properties. For a detailed how-to, please see the Configuring system properties page in the official Confluence documentation.
System property | Default value | Description |
---|---|---|
carch.events.maxSpaces | 1000 | The maximum number of spaces for which content audit log events are saved after the execution of the content lifecycle job. (For the spaces exceeding this, no event is saved to have an upper limit for the required storage capacity.) |
carch.events.maxPagesPerSpace | 200 | The maximum number of pages saved for each content audit log event. (The count of the pages exceeding this is displayed as "plus 123 more pages" when viewing events, but their details are not saved.) |
carch.notifications.maxSpaces | 1000 | The maximum number of spaces used to collect the notification email addressees. (For the spaces exceeding this, no notifications will be sent to have an upper limit for the generated SMTP server load.) |
carch.notifications.maxPagesPerSpace | 100000 | The maximum number of pages used to collect the notification email addressees per space. (The authors, last modifiers, etc. of the pages exceeding this are not notified.) |
carch.emails.maxSpaces | 200 | The maximum number of spaces listed in the notification emails. (The count of the spaces exceeding this is displayed as "plus 123 more spaces" at the bottom of the mail, to keep the mail relatively short.) |
carch.emails.maxPagesPerSpace | 100 | The maximum number of pages listed in the notification emails per space. (The count of the pages exceeding this is displayed as "plus 123 more pages in this space" in the mail, to keep the page list relatively short.) |
Modifying these parameters requires you to restart Confluence. You will see them taking effect at the next content lifecycle job execution, by receiving longer page lists in emails or seeing longer page lists in content audit log events, for instance.
To verify if your custom values are correctly picked up, increase the logging level to DEBUG and you should see log lines like this at the end of the content lifecycle job execution:
2017-01-24 10:41:33,708 DEBUG [Long running task: Content Archiving [CARCH]] [archiving.service.persistence.GlobalSettingsManager] getProperty System property <carch.emails.maxPagesPerSpace> is set to 500
Increase the application memory (heap size)
All memory intensive operations of the app are done in its background jobs, therefore it is useful to understand the basics how memory is managed in those.
The transactional unit for the background jobs is a space. Therefore the memory allocation pattern is like this:
- While processing a space, jobs gradually allocate more and more memory (proportionally to the number of pages, attachments, comments in that space).
- When the space is processed, jobs release the memory that was allocated while processing this space.
What does this mean?
- If you have spaces that are relatively "small" (0-30,000 pages), it will use transactions with small memory footprint. Even if you have a large number of spaces of this size, the computation scales well, and it is unlikely to have any memory issues with these.
- If you have one or more larger (30,000 - 200,000 pages) or extremely large spaces (200,000 - 500,000 pages), your Java Virtual Machine may eventually run out of memory, resulting in a classic Java OutOfMemoryError.
The solution is trivial: increase available memory following Atlassian's guide for Confluence.
An obvious alternative workaround is blacklisting the super-large spaces. We believe that it is very unlikely that a 200,000 page space was created by human authors, and if that is generated content, does the lifecycle tracking make any sense?
Other techniques
This section describes other techniques that tune the app from some other aspect, not strictly performance.
Tune the frequency of page view and page update journal writes
Page view and page updated tracking is implemented using the so-called deferred writes technique.
At every page view or update, the app adds the event to a journal, which collects those events in memory. This is a very fast operation, therefore the app will not cause any performance degradation. Not even in super-busy Confluence instances.
Then, the app writes the journal to the Confluence database periodically, either triggered by a regular Confluence scheduled job or when the journal reaches a certain size between two writes. Writes happen transactionally: either the all events in the journal are written to the database or none of those.
As a consequence, if Confluence is shut down or some technical problem occurs before the journal is written to the database, then the events in the journal are lost. This is very unlikely to cause any problem in practice, but it is useful to be aware of it.
If the last view information is super-critical data to your organization, increase the frequency of the writes (which is 30 seconds by default). You can do this in easily by configuring the schedule for these two jobs:
- Better Content Archiving: Persist the Content View Journal
- Better Content Archiving: Persist the Content Update Journal
Questions?
Ask us any time.