At CCC we use Polyglot Persistence in many of our products. This inevitably leads to the need for data integration. A tool we’ve found useful is Talend Open Studio for Big Data. This “tips-and-tricks” post discusses using Talend to migrate data from a MongoDB database to a PostgreSQL relational database for reporting and similar use cases.
Talend Open Studio for Big Data comes prepackaged with connectors and components for a large number of open source and proprietary “big data” technologies, including Cassandra, HBase, Neo4J, and MongoDB, which is our source NoSQL database for this article.
Note that the version of Talend Open Studio for Big Data used in this post is 5.5.1.
For this job, we’ll need only four Talend components:
tMongoDBConnection to connect to our source MongoDB database
tMongoDBInput to extract a particular set of documents
tExtractJSONFields to transform the MongoDB document into columns
tPostgreSQLOutput to store the resulting data in a PostgreSQL target database
The key component here is the tExtractJSONFields which transforms the JSON document returned by a MongoDB query into individual fields. This component is actually based on the concept of XPath for XML and pulls out individual pieces of a JSON document.
Begin by dragging the tMongoDBConnection component on the the Designer. Configure the connection information for the MongoDB database. This will look something like the following:
Now drag a tMongoDBInput component to the Designer. Join it to the tMongoDBConnection via an “OnSubjobOK” trigger.
Now configure the tMongoDBInput component.
It’s important to note a couple of things about the configuration of the tMongoDBInput component.
But there is hidden complexity in the fact that the seemingly-simple “workAttribute” can itself contain sub-documents or arrays of values. This is where the tExtractJSONFields component is invaluable.
The tExtractJSON component turns a single JSON “column” into multiple columns, which is what we’re after for storing data in a relational database. You find the tExtractJSON component under the “Processing-Fields” section of the Talend Palette.
In the example below, we use the tExtractJSONFields component to pull apart the attributes contained within in the “workAttribute” MongoDB sub-document. Here’s an example of a workAttribute as shown from the MongoDB database:
We want to transform this document so that PostgreSQL can store it in a series of rows and columns. Here’s what the tExtractJSONFields component configuration looks like to do this:
The key options are the “JSON field”, and the “Edit schema”. It is via the latter option than we tell the component how to transform the structure of the incoming JSON field. In our case, we want to break the tMongoDBInput field “workAttribute” into multiple output fields:
Once you click OK, we can verify the fields that will be sent to our relational database, in the “Mapping” section of the tExtractJSONFields component; this will be filled-in automatically, based on our configuration of the schema.
The resulting structure is now suitable for inserting into our PostgreSQL database. Drop a tPostgresqlOutput component from the Palette onto the Designer, and join it to the tExtractJSONFields component via “Row-Main”, as below.
Configure the tPostgreSQL component to point to your database, then run the transformation:
We can now inspect the PostgreSQL database to see the result of our transformation.
Using Talend Open Studio for Big Data, we’ve been able to translate a MongoDB document structure into the time-honored relational database notion of columns and rows.