How to govern unstructured legacy content?

Your content is in a mess.

(If you read that and disagree, then this blog post probably isn’t for you.)

Your organisation has assembled many years of files, emails, documents and miscellaneous content, which have been stored across multiple platforms with deep and painfully inconsistent folder structures. Metadata is something you can only dream of – and automated records management is a distant objective that you know that you need, but can’t imagine will ever be remotely possible. You might have tried to work out quite how much content you have, or where your most valuable information has been stored – often you might have found the results disheartening. I like to refer to this ever-growing mountain of unstructured legacy content as the ‘digital heap’ – something that you know you need to tackle, but don’t know where to start.

That was of course before you rushed into enabling Microsoft Teams. So now, in addition to the creaking file shares and bulging SharePoint sites, you have hundreds or perhaps thousands of Teams that have sprung up out of nowhere, leaving you unable to understand which part of the organisation they belong to, let alone the value or sensitivity of the information they contain.

If that sounds like you – don’t worry you’re certainly not alone! Almost every organisation’s unstructured legacy content tends to resemble something like this – only a small number have managed to successfully rein in their digital heap.

I regularly get asked about how to tame this chaos, something that I’m going to attempt to answer in this post.

How to govern unstructured legacy content in my organisation?

1. Turn your back on it

You’ve probably already come up with several strategies for tackling the heap. Perhaps you should segment it? Perhaps you should identify areas that are safe to delete? You’ve wondered how much of the pain you can delegate to the wider business – then realise that they are going to focus on their day jobs and will never find time to tidy up the mess.

You might even have started implementing some of these approaches and have managed to take a few chunks out of the heap – only to find that despite your efforts it’s somehow still managed to continue to grow bigger.

I feel that the best initial thing you can do to tackle your unstructured legacy content is to turn your back on it. Instead of focusing your time and effort on the existing mess, I’d urge you to direct your attention towards the endless flow of new working areas that are being created every day. If you can apply governance to this ‘dripping tap’ of new SharePoint Sites and Teams, the digital heap will become a lot easier to handle.

2. Control the creation process

Now, let’s be clear, the constant drip-feed of new workspaces is something you don’t want to stop altogether. If you turn off the tap, staff will quickly find new ways of collaborating – very often ones that are completely outside of your control. Instead, of merely stopping users from being able to create new Sites and Teams, what you want to do is control the creation process.

how to govern unstructured legacy content - faucet image

In Microsoft 365 it’s really easy for users to create new Sites and Teams. By default, any user can easily spin up as many new working areas as they want, give them any name they chose and invite any colleagues that they wish to join them as members. This results in a bit of a free for all – with hundreds or thousands of workspaces being created, without any easy way of determining the purpose, value, or even the owning department.

The first step is to prevent users from creating new workspaces by turning off self-service group creation. However, this cannot be done in isolation – as turning off the tap will quickly cause pressure to build as staff find alternative collaboration solutions. As such, when you disable self-service group creation, you will want to simultaneously introduce a new provisioning process.

The aims of this new provisioning solution will vary between organisations, but typically most want to ensure that their workspaces are being configured consistently. There are lots of different approaches you can take here, but I would suggest that you start by identifying your objectives. Typically, the primary objectives of provisioning include:

  • Supporting multiple ways of working – your provisioning process can include a series of different ‘templates’, each of which is optimised to support different types of activity across the organisation. As such, your provisioning process can utilise templates that provide a consistent starting point for the structure of your projects, committees and departments (and other types of work that are common across your organisation).
  • Content classification – a provisioning process allows you to identify and apply appropriate default metadata to your libraries and folders. Each file will then be tagged ‘by-stealth’, simply based on where it has been saved. The idea here is to ensure that content is automatically tagged when it is first created, reducing effort for staff, while significantly improving the ability to find and manage information.
  • Context – controlling the provisioning process offers you a unique opportunity to compile information about the context of the Site/Team, making it far easier for you to appraise its value in the future. If you choose to, you can even present some of this context back to your users – perhaps through naming conventions, descriptions or even replacing the workspace’s default image – which can help your staff become more confident about the purpose, ownership and security of their workspaces.
  • Information Protection – you can easily weave sensitivity labels and even data loss prevention policies into your templates, so that content that requires a higher level of control can be protected automatically.
  • Retention – one of the core aims of controlling provisioning is to ensure that all content is automatically included within the scope of your records management strategy. By integrating retention labels and policies into your provisioning process you can make certain that all of your records are governed across new workspaces.

Very often you will encounter resistance if you try to introduce controls around the creation of Sites and Teams. I’ve often heard people argue that the introduction of a provisioning process will impose barriers that delay or even impede users. This is very much not the objective!

Instead, our aim should be to ensure that the process of creating a new Site or Team is as simple as possible and that it introduces benefits for the whole organisation. Sure, you’ll need to introduce a new form that captures information about the nature of the working area that is being requested, and, naturally, filling out this form will slightly slow down the creation process. However, instead of focusing on the negatives, make sure to extol the benefits that you can introduce: not just the improved governance, information protection and reduced duplication, but also how much easier it will be for staff to search for and find well-classified content. I’d strongly believe that applying governance through a provisioning process will lead to significantly improved organisational efficiency in the medium/long term.

3. Undertake a high-level audit of your legacy content

Don’t worry, I haven’t forgotten! Fixing the issue with the dripping tap doesn’t fix the issue with the digital heap – but it certainly helps!

Once you’ve fixed the tap, the digital heap stops growing. From this point forward every chunk you can take out of the heap will be reducing it.

The first thing I’d recommend after turning off the tap is to undertake a high-level audit of your digital heap. Try to identify the volume of the data, the depth of your folders, and map this to the nature of the content and the part of the organisation who ‘own’ each area. There are automated tools (such as SharePoint Migration Assessment Tool or for file shares DROID or TreeSize), which can help with this.

Some organisations decide to assemble a team who are tasked with working through the heap to assess or even migrate content into a different structure. This is certainly a feasible, if costly, approach, which certainly can prove effective if there is a pressing need or deadline involved. Personally, while this process can certainly make significant inroads, or even flatten your heap altogether, it is frequently too time-consuming for many organisations to countenance.

4. Use various tools

Another approach is to look towards technology as a potential solution that can help you chip away at your digital heap. For content in Microsoft 365 we can make use of various tools to try to apply context to your content at scale, including:

  • Trainable Classifiers – identify common types of content across your tenant and automatically apply retention and/or sensitivity labels to them. Trainable Classifiers allow you to take advantage of AI to automatically find consistent types of file. You train the classifier with at least 50 examples of the type of content and the AI will do the rest, by automatically scanning areas of your tenant and tagging files that are identified.
  • Sensitive Information Types – another method of scanning your Microsoft 365 content at scale is to use Sensitive Information Types. These allow you to find content containing specific codes or reference numbers. They are especially useful when looking for content that contains personal information such as a driver’s licences or passport numbers. Once content containing the code/number has been found, you can automatically apply either retention or sensitivity labels to them, helping to improve the governance of content across your tenant.
  • Azure Information Protection unified labelling scanner – if areas of your digital heap are stored across file shares or on-premise SharePoint farms then AIP scanner might be a useful tool to consider. The scanner allows you to automatically apply sensitivity labels by identifying content containing specified sensitive information types or regex patterns – perfect for extending the governance found in Microsoft 365 across your legacy data.
  • SharePoint Syntex – a great tool for scanning files and automatically extracting metadata through AI. Best used for more consistently structured content (such as invoices and purchase orders), SharePoint Syntex allows you to build models that scan and apply labels to content that it identifies. If you want to find out more about SharePoint Syntex, check out my colleague Leon’s blog.
  • Viva Topics – another workload that takes advantage of Microsoft’s AI capabilities, Viva Topics scans your content and identifies relationships in your existing data. The product automatically builds a knowledge network, using AI to identify key ‘topics’ – essentially it’s a bit like having an internal Wikipedia, built out of your existing content. While Topics certainly doesn’t replace a good information architecture, it presents an interesting option to automatically derive additional value from your legacy files.

There are plenty of other technical solutions you can lean on to help you tackle your digital heap, with a wealth of 3rd party products available that scan, assess and classify your content. However, I should point out that you might need to combine several approaches, as each in isolation will only help resolve some of your legacy governance issues.

Finally, while ‘doing nothing’ to tackle your digital heap clearly isn’t a solution, I should point out that once you’ve fixed the dripping tap, your heap becomes easier to manage with each passing year. Frankly, as the information in the heap drifts from active to legacy, the process of making bulk decisions becomes much simpler. Now to be clear, I’m not suggesting that you can reach for the delete key and dispose of the entire heap – but it will become easier to identify areas of the heap that don’t have high value and perhaps even haven’t been accessed in several years – and use this information when making your decisions.

If you want to have a chat about your own challenges with the digital heap, feel free to throw questions my way – I’m always happy to try to steer you in the right direction.

Receive more blogs like this straight into your inbox

Sign up to receive our latest blogs and stay up to date with our latest news, Microsoft 365 updates, events, webinars and workshops.

Keywords: How to Govern Unstructured Legacy Content, Tackle Legacy Content, Provisioning, Content Creation Process