January 27, 2015

Using Talend Big Data to Move Data from MongoDB to PostgreSQL

At CCC we use Polyglot Persistence in many of our products.  This inevitably leads to the need for data integration.  A tool we’ve found useful is Talend Open Studio for Big Data.  This “tips-and-tricks” post discusses using Talend to migrate data from a MongoDB database to a PostgreSQL relational database for reporting and similar use cases.

Talend Open Studio for Big Data comes prepackaged with connectors and components for a large number of open source and proprietary “big data” technologies, including Cassandra, HBase, Neo4J, and MongoDB, which is our source NoSQL database for this article.

Note that the version of Talend Open Studio for Big Data used in this post is 5.5.1.

Components

For this job, we’ll need only four Talend components:

  • tMongoDBConnection to connect to our source MongoDB database

  • tMongoDBInput to extract a particular set of documents

  • tExtractJSONFields to transform the MongoDB document into columns

  • tPostgreSQLOutput to store the resulting data in a PostgreSQL target database

The key component here is the tExtractJSONFields which transforms the JSON document returned by a MongoDB query into individual fields.  This component is actually based on the concept of XPath for XML and pulls out individual pieces of a JSON document.

Process

Begin by dragging the tMongoDBConnection component on the the Designer.  Configure the connection information for the MongoDB database.  This will look something like the following:

image

Now drag a tMongoDBInput component to the Designer.  Join it to the tMongoDBConnection via an “OnSubjobOK” trigger.

image

Now configure the tMongoDBInput component.

image

It’s important to note a couple of things about the configuration of the tMongoDBInput component.

  • The query string is simply what would normally be inside the db.find() operator, if you typed a query from the mongo shell program.  In our example, we’re returning an entire document, without any filtering on the type or characteristics of a document.  
  • You must define the schema of the JSON document, by clicking the “Edit schema” button.  Here you can define all the keys from your MongoDB document.  Ours is quite simple, consisting of only three keys:

image

But there is hidden complexity in the fact that the seemingly-simple “workAttribute” can itself contain sub-documents or arrays of values.  This is where the tExtractJSONFields component is invaluable.

The tExtractJSON component turns a single JSON “column” into multiple columns, which is what we’re after for storing data in a relational database.  You find the tExtractJSON component under the “Processing-Fields” section of the Talend Palette.

In the example below, we use the tExtractJSONFields component to pull apart the attributes contained within in the “workAttribute” MongoDB sub-document.  Here’s an example of a workAttribute as shown from the MongoDB database:

 “workAttribute” : {
    "preserved" : false,
    "isNew" : false,
    "useSPK" : false,
    "jobUnitId" : 1,
    "verified" : false,
    "fromManualMatching" : false,
    "valid" : 0,
    "requiresVerification" : false }

We want to transform this document so that PostgreSQL can store it in a series of rows and columns.  Here’s what the tExtractJSONFields component configuration looks like to do this:

image

The key options are the “JSON field”, and the “Edit schema”.  It is via the latter option than we tell the component how to transform the structure of the incoming JSON field.  In our case, we want to break the tMongoDBInput field “workAttribute” into multiple output fields:

image

Once you click OK, we can verify the fields that will be sent to our relational database, in the “Mapping” section of the tExtractJSONFields component; this will be filled-in automatically, based on our configuration of the schema.

image

The resulting structure is now suitable for inserting into our PostgreSQL database.  Drop a tPostgresqlOutput component from the Palette onto the Designer, and join it to the tExtractJSONFields component via “Row-Main”, as below.

image

Configure the tPostgreSQL component to point to your database, then run the transformation:

image

We can now inspect the PostgreSQL database to see the result of our transformation.

image

Using Talend Open Studio for Big Data, we’ve been able to translate a MongoDB document structure into the time-honored relational database notion of columns and rows.

Glenn Street

Data Architect