What is the best way to migrate Terabytes of data into an ECMS?

Oct 28, 2019 5:46:33 PM

As part of our regular series on content management and cloud deployments, I wanted to give some practical advice on how to start migrating terabytes of data into a cloud-based enterprise content management system (ECMS).

At a high-level, the steps are:

  • Scope
  • Analyze and Categorize – iterative
  • Phase the Migration
  • Discard and De-duplicate
  • Identify / Procure Target Systems
  • Migrate
  • Exceptions
  • Ongoing Maintenance and Training

Where to Start?

You know that you need to migrate all or most of your content into a managed CMS, but it’s hard to know where to start. Migration of existing content is often the most complex part of a project to deploy records or document management, but the process can seem overwhelming if you don’t have a clear plan. The following is a guide to that process and a breakdown of the steps you will need to undertake to successfully migrate your existing electronic content.

An analogy

I’ve always been a reader and book collector, and so has my partner. I grew up with books and when I lived in New York City I lived close to two of the largest bookstores in the city.  Suffice to say, I own a lot of books and so does my partner.

When we bought a house together, we moved everything and then faced the daunting task of unpacking and organizing it all. Between us, we had at least 100 boxes of books, maybe more, and for months they inhabited an entire room of the house, intimidating me with the scale of the task. Then, after an embarrassingly long period of time, I started the process of organization.

First, I unpacked each box and put the books on shelves as soon as they came out of the box. When I ran out of shelf space, I started to unpack them into piles on the floor of the guestroom until I could finally see the extent of the collection.  Once I knew the scope, I built some new shelves to accommodate the overflow. Then I scanned though the entire collection to get some idea of categories. I divided (in my mind) paperbacks, hardbacks, and large format – mostly because each needed different types of shelving. And then for each of those, there were broad categories like literary fiction, travel, science, geology, art, architecture, technology, philosophy, etc.

Then I started to collect these categories together, and as I did I needed to refine my model – “science” was split into natural history, environmental, and physical sciences, “literary fiction” was split into classics and modern. As I grouped and collected the volumes together, I noted where I had duplicates and also where I had something that I no longer wanted to keep. 

It’s still not perfect and we still need some more shelving, but we are close to where we need to be and, for us, three thousand plus books make our home a great place to be.

Is my allegorical (and true) story applicable to your information management challenge and the challenge of migrating your existing content into a managed system? I think so – you have lots of files in many different places and formats, only some of them are valuable and still current, and you need to identify what to keep, where to keep them, and how to organize them.

Scope

First, you need to know the (broad) extent of your challenge. Most of your content volume is probably on shared file systems or legacy content or records management systems (if you are lucky), but it may also be in archives (PSTs and similar), obsolete or outdated applications or systems, file boxes or shelves of paper, local HDDs, etc. Try to get an idea of how much content you have – which will give you insight into the extent of the challenge and how much time and resources you need.

NOTE: The following may be helpful for translating paper to electronic file sizes:

  • A standard file box holds 2,000 pages.
  • 1GB of electronic material equals approximately 65,000 to 165,000 pages. Let’s take 100,000 pages as the average.
  • 50 storage boxes = ~1GB of electronic storage.
  • However, scanning / conversion is expensive - the industry average to scanrecords is between $0.08 and $0.15 per page. A standard box of records typically costs about $270 to scan. For small projects of a few boxes or less, the cost to scan a box of documents is about $400.

In the case of one of our recent clients, the engagement was driven by the fact that their 27TB storage allocation for departmental shared drives was full and they had no capacity to add extra disk. In addition, they wanted the retention management, improved security, and searchability that a structured storage system would provide. Once we got started with the project, we discovered another 40TB of content on other systems, so we had to update our plans and budgets.

Content that is already in a legacy or obsolete content or records management system may be easy to migrate or it may be a very complex challenge. You need to be able to extract the files, metadata, and security information from the source system and then convert them to a format you can use for importing into your new target system. Depending on your business ad compliance requirements, you may also be required to migrate previous versions, audit histories, and other data (and each of these makes the challenge more complex). If you are lucky, your “old” system will support a standard output format like CMIS, but if you are unlucky you may have to use a series of service calls via supported APIs or SDK. In the latter case, see if a migration expert (like Proventeq) already has this integration built, because it will almost always be quicker and easier to buy rather than build.

Analyze and Categorize

Our first step in the migration process is to understand what we have – there are many tools that allow you to analyze shared file systems and your organization may already own one or more – we have had good experience with market-leading migration software providers like Proventeq, plus TEAM IM has a tool that we built ourselves. The differences between the two are in footprint – the Proventeq tool has a full-featured user interface and supports many different integrations, but it does require installation on client’s servers and a database. The TEAM IM tool is much more lightweight and is optimized for analysis of file shares on windows systems and mounted file shares.

What you want is a searchable list of files and folders with the size and organization of each folder and the name, size, type, age, and owner of each of the files. Once you have this data it is up to you to examine it and make some observations – what are the types of your largest files? How recently was most of your content created? When was it last updated or accessed? Do you have an issue with files or folders being owned by deleted accounts?

You will need to work with the content owners, record managers, and other stakeholders to create a plan as to which material to keep and what to discard, how to organize what is being retained, and where to put it. Then it’s time to put the plan into action.

TIP – if your storage array has snapshotting or similar functionality you will want to exclude those folders from analysis.

Phase the Migration

If you have a small organization or a small collection of files you can do all of this in one go – but if that were the case, you would probably have done it already. In a larger organization or one with a lot of content, you will want to break the process down into phases and it’s up to you to decide what those phases should be. 

Phases that have worked well for us in the past include:

  • Starting with the largest files and moving all of them to different storage
  • Starting with one or two departments and learning from the challenges that came up in each
  • Following a geographic pattern – moving files belonging to groups in one location first and being onsite to support them

In the end, though, you’ll have to come up with a plan that works best for you and assess how successful it is as you work through the phases.

Discard and De-duplicate

There are arguments for migrating everything and then doing clean up in the new system and there are equally valid arguments for cleaning up before the move. I think that both are valid and, in most cases, it’s good to do both. Your initial analysis will have identified some content and files that have no value to you, so you can and should discard them before migration (caveat: if you are in a regulated environment of any kind you may need proof that you discarded files and when and why – be sure to document your procedures and criteria for disposal. If in doubt, import them into the new system and then immediately expire them – which will generate all the audit trail you need).

Identify / Procure Target Systems

This can (and will) be a white paper or blog post all on its own – but you need to have a better place to store your content than a shared file system. The key to selecting a new ECMS or replacing an existing one which no longer meets your needs is a clear statement and documentation of your organization’s needs. Many of the formerly prominent vendors have been acquired or have expanded their offerings to cover marketing and customer data platforms (with a commensurate increase in complexity and price), and there are a host of new players that have arisen to round out the sector. In this complex landscape, it can be a challenge to find vendors and products that match your requirements, budget, and existing technological footprint.

Fortunately, unstructured content is TEAM IM’s sole focus and we have been managing it for over twenty years. To help our customers make sense of the complex web of offerings, capabilities, and platforms, we offer a content advisory service that can help you through the requirements gathering, selection, and procurement processes (and beyond into strategy). We deploy more content and records management systems in a year than most people will in a career, so we are well-placed to make sure you invest your time and resources wisely and effectively.

Migrate

Once your new target system(s) is/are in place you will want to test your migration process on a subset of content and in a non-production environment. Make sure you have backups and plans for data preservation – this content is important and valuable otherwise you wouldn’t be migrating and preserving it. Even within your phases, you may find that the migration process is iterative – you may migrate all large files first, then infrequently accessed and then only the remainder in the last iteration.

It’s generally a good idea to keep the old system in a read-only mode for a week or two just to be sure the migration has gone as planned.

Each of your new systems will have tools for migration and you’ll need to look at each and be careful of any limitations. For instance, the maximum file size for OneDrive for Business or SharePoint Online (which share a storage architecture) is 15GB – if you have larger files than that you will need to find an alternative place to house them. Other systems may have file type exceptions. If you have content that is “owned” by a nonexistent account, you’ll need to make a decision as to which account will own the files in the new system.

Migration can take a long time, so make sure you test large data sets before impacting day to day operations.  We have seen uploads or 1TB / day from local disk to SharePoint Online, but that bandwidth usage will almost certainly impact normal traffic during work hours.  For other clients with limited connectivity we have had to slowly upload content from local servers to a new data center and in some cases ship (encrypted) hard drives to speed up data transport.  Plan for this and set reasonable expectations with management and users.

Exceptions

Some exceptions were listed above – for instance, cloud systems often have a file size limit (15GB for SharePoint Online, 5GB for Box, 50GB for Dropbox.  Most on-premises systems have much smaller initial file size limits, but these can usually be configured to be larger).  Most content management and file storage systems have file type specifications as well – usually to protect the system from the upload of executable code. 

If you have unusual file types or need to store scripts or executable code, make sure this is allowed.  You may have to compress the files as zip or tar archives to get around these limits.

Business exceptions – you may have a department or group with different security or functional requirements, and in this case, you need to manage these within your chosen platform or find them a different place to store their content.  It’s always important to separate wants from needs in these situations and to determine if they have a budget to provide for their exceptions.

Ongoing Maintenance and Training

Enterprise applications should be maintained like a large bridge – once you have finished painting one end, you need to go right back to the beginning to start again. 

You can’t hope to get all of your requirements perfect in the first release and as your people start to use the system their needs and preferences will evolve and change. So, it’s important to iterate and refine your configuration and capabilities to match and reflect the evolving usage of the system. At the same time, your systems contain vital and important organizational data, so it is crucial to stay on top of patching and deployment of upgrades as they are released.

One of the major attractions of cloud systems (whether SaaS, IaaS or PaaS) is that you are buying support and maintenance along with the capabilities – but you must be very clear who “owns” which layers in the stack. All of them will require maintenance, optimization, patching, and updating and responsibilities for each must be very clear at the beginning of the project and ongoing.

The same is true for training and support – most organizations will have staff turnover and it’s important that there are processes in place to train new users on best practices for using the various systems. The goal here is to keep the new systems in use and in good order – to prevent you from having to do this all over again in another 5-10 years.

Good luck!  And if you have any feedback or need help with planning or executing your migration effort, get in touch. TEAM IM Informatics has been doing this for twenty years across dozens of platforms and versions. We are the content management experts and we would be happy to talk to you and provide guidance and assistance.

You May Also Like

These Stories on Enterprise Content Management

Subscribe by Email

No Comments Yet

Let us know what you think