April 27, 2016

MongoDB to PostgreSQL JSONB via Talend

In a post last year, I described using Talend Big Data to move documents from a MongoDB database to a PostgreSQL database. The aim of that exercise was to pick apart the JSON document as stored in MongoDB and store certain keys into individual PostgreSQL columns. But what if you want to store the entire JSON document in MongoDB for possible future processing with PostgreSQL? Since release 9.2, first available in September 2012, PostgreSQL has offered a JSON data type. In the subsequent 9.4 release (December 2014) PostgreSQL added a more advanced JSONB type, “a more capable and efficient data type for storing JSON data” than the original JSON data type.

Both JSON data types have strong advantages over storing a MongoDB document as a simple text field. First, PostgreSQL will assure that the JSON you’re trying to store is valid. In addition, PostgreSQL offers a number of operators to work with JSON documents, including ones that allow you to traverse the document to find individual keys and their values. As opposed to the JSON data type, the JSONB data type has the advantage of being stored in a binary format that eliminates reparsing the document on retrieval. This can mean faster reads compared to the original JSON data type.

The PostgreSQL manual recommends that

In general, most applications should prefer to store JSON data as jsonb, unless there are quite specialized needs, such as legacy assumptions about ordering of object keys.

So, let’s say that you are convinced that you want to store data that you formerly kept in MongoDB in a spiffy new JSONB column in your PostgreSQL database. Talend can do that, can’t it?

The answer is a resounding yes, but it’s a bit trickier than you might expect because of the way the JSON document can be interpreted by PostgreSQL on input.

The basic idea of the Talend job to move data from MongoDB to PostgreSQL is very simple. First, set up a tMongoDBInput component to read the document from your MongoDB database. Next, add a tPostgreSQLOutput component to your job. Finally, link the two components with a row (Main) connector. Here’s an example of what this would look like in Talend Open Studio for Big Data.

image

Unfortunately, this job will fail, because on insert, the JSON document isn’t recognized as a specialized text type. Therefore by default, PostgreSQL’s JDBC driver treats it as a VARCHAR. In fact, the PostgreSQL JDBC driver doesn’t support JSONB directly (the JDBC standard doesn’t support a JSON data type yet). What you’d like to do is simply pass the JSON as a Java String and have PostgreSQL recognize it on input, automatically casting the String to a JSONB data type.

There is a way around this problem, though, by using the JDBC connection parameter stringtype. This is, in fact, how you can work with the PostgreSQL JSONB data type from your own Java programs. By setting stringtype=unspecified on your JDBC URL, PostgreSQL will silently cast a String to JSONB (“parameters will be sent to the server as untyped values, and the server will attempt to infer an appropriate type”). So, all we have to do is add the parameter to the URL in the Talend tPostgreSQLOutput component, correct? Unfortunately, the Talend tPostgreSQLOutput component doesn’t allow you to modify the JDBC URL that it generates. But, because Talend offers so many different components, including some more general ones, we can use a tJDBCOutput component in place of the specific tPostgreSQLOutput component.

The job would now look like this:

image

The tJDBCOutput component allows you to connect to any database that supports JDBC and for which you have the appropriate driver jar file. Of greater interest for this article is that the tJDBCOutput component allows you to specify your own JDBC URL, including any specific connection parameters that the driver supports. In our case, we want a URL like this: “jdbc:postgresql://localhost/postgres?stringtype=unspecified”. Here’s an example of the full configuration of the tJDBCOutput component:

image

The results of running this revised job are exactly what we were looking for. Here’s a snippet of the resulting PostgreSQL table with our JSON data neatly stored in a JSONB column:

image

Glenn Street

Data Architect