Content Migration using Talend
Redevelopment of a website is often triggered because of three major factors:
- The current website is built on the technology stack which is now obsolete
- Redesigning/Revamping the existing website, either to address the weaknesses in the current system or to add significant features
- Switching to a new technology platform, such as a new Content Management System (say AEM)
Often these factors are coupled together, selecting the new technology platform combined with redesigning the existing site. Identification of the new technology platform is a difficult task and subject to various factors like budget, feasibility, stability of the new technology stack, maintenance support, time to market etc. No matter what that choice is, more often than not gives birth to migration projects. An organization that has thousands of Pages, Articles, Assets etc would want to retain that data rather than creating everything from scratch. Migration has a very wide scope, but this blog post will talk about Content Migration.
Content Migration is a process of migrating the existing Digital Media of an organization to the new System. This is certainly not a simple process.
A change in technology platforms makes the migration challenging, as does a major restructure or redesign of the site.
Content Migration can be achieved by either of the following two ways or sometimes combined:
- Manual: Ctrl+C and Ctrl+V are the favorite keyboard shortcuts for every developer. The manual way is always the easiest yet the most painful one. If it is about a few pages, you might want to copy the content from the old site and paste into the new publishing tool. But, if the old system contains thousands of pages, would you want to follow that route? Maybe you can hire a team of content authors who will do the job for you. But a manual process is error-prone.
- Automated: Option of automating the entire process of migration is clearly an appealing one. Using some tool/methodology where you can define the rules for the migration process. This requires little or no manual effort. Talend Open Studio (ETL tool) is one such tool which can be used to automate the content migration process .
There are three basic requirements for migration:
- The input export of the existing content. It can be in any form e.g. Delimited Text file, XML file etc depending on the existing system.
- The output format i.e. What should be the end result of the migration process? Which data from the existing system should map to the new system (AEM in our case)? You should be clear with all the mapping and transformation rules specific to the new system. As we are dealing with migration to AEM, then we need to define the mappings between the existing content and AEM components. For instance, if the input extract received is an XML file then you would have to define the mappings among XML tags and the properties of an AEM component.
- Loading Mechanism which defines how the content gets loaded into the target System. This is a very important part as whole migration process will be designed based on the method of load. We’ve chosen the approach of creating a valid CQ Package which can be installed from CRX package manager. One of the major advantages of using this approach is that we can easily rollback and uninstall the package.
A basic migration job created using Talend looks like as follows:
Each block in the above picture is a component, tRunJob in this case which calls another sub-job. The connectors between two such blocks define the transition i.e. how and when do we want the next block to be executed. In this case, these transitions are called as triggers.
This main job consists of four sub-jobs. Purpose of each sub-job is explained below:
- Pre-migration Cleanup: This job reads the input content (say XML) and breaks it into smaller manageable chunks (multiple XML files) which can be worked upon individually. The job can be modified to handle scenarios like Internal URL mapping, resolving the character encoding issues, define any tag mapping rules etc.
- Extraction & Transformation: This job reads the XMLs created in the previous step one by one, transforms it to AEM specific .content.xml schema and stores it under the required jcr_root hierarchy on the file system.
- Post Migration Cleanup: This job is required if there are any post-migration cleanups that need to be done.
- Packaging: This is the final step of migration which creates the archive of the pages migrated in the above steps. Keep in mind that the package needs to be AEM compatible i.e. it should contain jcr_root & META_INF folder and associated metadata properties as per AEM packaging standard.
You can download this e-book for more details.
Hope this helps !!
TechAspect has more sophisticated tool than this.