You’re presented with a new set of data, or a database that you’ve never worked with before. How can you get an understanding of that data so that you can use it effectively? Simple: profile that data. One definition of data profiling describes it to be the activity of “…examining the data available in an existing data source and collecting statistics and information about that data.” The point of profiling is to get an overview of a data set, enabling you to ask additional questions about that data. Data profiling is also a very important component of assessing the quality of a set of data, and is typically the first activity in data quality investigations. Some typical artifacts of the data profiling process are distributions of values for an attribute, whether presented in a table or a bar chart; measures of central tendency, like mean, median, and mode; and, measures of how data is spread out, for example using a boxplot. In this era of “big data” and “data lakes”, an understanding of data from disparate sources is crucial to help make meaning out of the mass. This post is the first in an envisioned series on the topic of Data Quality.
We’ve already touched on some of the reasons for profiling a set of data, but let’s expand on these reasons.
One area where data profiling is important, but which may not be obvious to a casual database user is for query optimization. All modern relational databases employ cost-based optimizers and collect their own profile of data in tables and columns. Such information informs measures of the selectivity of an index, for example, which may in turn cause the optimizer to choose a particular index for a query’s access plan, or not. Based on the database’s chosen query plan, a developer might revise the query in question, add an index to a table, or even refactor the table or tables involved. As database technology has matured, query optimizers have grown from simplistic rule-based operations to more automated, cost-based optimizers. What is the foundation for an effective cost-based optimizer? It’s statistics, or profiles, of data in tables, indexes, and columns. Here’s an example of the kind of profiling data that PostgreSQL collects (from the pgstats table).
We can see that the PostgreSQL optimizer is interested in the fraction of a table that contains null values for each column, the number of distinct values for each column, and the most common values across the table for each column.
Before incorporating any data into an existing data warehouse, it’s critical to understand that data. It may turn out, after you’ve profiled the data, that it doesn’t conform to business assumptions and is therefore not particularly useful to add to your existing warehouse. Ralph Kimball, one of the fathers of the modern data warehouse industry, has claimed that data profiling “…should be the mandatory ‘next step’ in every data warehouse project after the business requirements gathering.”
An entire book, the classic “Data Preparation for Data Mining” by Dorian Pyle, talks about in great detail about the need to profile data before using it for data mining activities. Pyle calls the process “Data Characterization”, and he discusses ideas very similar to what we refer to today as Data Profiling. Pyle’s theme is that without properly understanding and profiling data, data mining activities are unlikely to succeed. For example, what if you are trying to analyze sales data in your company’s German division, but you discover that the data you have to work with is only for sales in Asia? Without taking the time to profile and understand the data, you very well might have drawn erroneous conclusions through your data mining activities. Pyle refers to this work as “Surveying the Data” and says that “A most particular purpose of the survey is to find out if the answer to the problem that is to be modeled is actually in the data prior to investing much time, money, and resource in building the model.” (pg. 26).
When dealing with sets of data that come from more than one source, even when that data is on the same subject, profiling the data provides insight into ways to reconcile the different data representations. It’s important to reconcile different representations of the same piece of data from different data sets, before combining those data sets together. For example, in one data file states may be identified by a two-digit numeric code, but in another data file, by the two-alpha code most of are accustomed to.
Working with any unfamiliar data set requires assuming that the data set is clean and free of errors. However, data profiling offers a way to test the hypothesis that the data is actually trustworthy. Data, even from the most reliable sources, is subject to a large number of errors, often introduced by well-meaning human beings, or source applications that contained bugs. But profiling data allows you to model, for your purposes, what a good set of data would look like. Then you can create rules for remediating those cases where the data is less than your ideal. For example, by examining the most common occurrence of a date pattern (e.g, “MM-DD-YYYY”) in a set of data, a developer may propose a rule that enforces this format for any future data that is to enter the data set.
The same rule described above (“date should have the format ‘MM-DD-YYYY”) can be used to clean any existing data that doesn’t conform to this rule or any new data that is a candidate for inclusion in your data set. The rules become class methods that allow you to fix data that’s already broken and to prevent bad data from entering into your data store in the future.
Since you’re unfamiliar with the data in question, you’ll want to get a good understanding of just what kind of data you’re dealing with before moving onto data integration, data mining, or similar activities. For structured data, in relational databases or XML documents, the process is relatively straightforward. The kinds of information we want to identify include these items:
Are columns or attributes strings, numbers, or something else? Without understanding the types of data in your set, you won’t be able to know what kinds of meaningful operations you can perform on it. You can’t sum a set of strings, for example, in a meaningful, useful way. You might also identify anomalies between the name of an attribute and the data itself. A column called “person_name”, for example shouldn’t store numeric data.
What are typical patterns of values for attributes or columns? An example of a value pattern are different ways of storing dates, as we discussed above. Dates can be stored in an ISO format (e.g., “YYYY-MM-DD”), or in any of a number of very different format (for example, see the discussion of Java date formats here). But what if the date information is stored in varying formats? You probably will decide to transform it to a standard representation before attempting to use it for any kind of analysis. As we touched on above, identifying value patterns can enable creating data quality rules at a later time, to help prevent poor data from entering your data store.
What are the minimum and maximum values of each attribute? Identifying these values can also lead to data quality rules. For example, if a value for a single record lies outside (it’s an “outlier” – a single value well outside the range of the mass of the data) a typical range of values for all records in the set, it may be an indicator of poor data quality and you may want to create rules to reject such records. Value ranges are most useful for ordinal or numeric values with discrete possibilities. For strings or more complex data, for example JSON, they are likely to be less helpful. Creating a distribution of word frequencies, or a concordance, may better serve to highlight anomalies in textual data than a simple range.
What per cent of fields have missing values for a record? Sometimes you’ll just want to exclude records that consist of many attributes with missing values, because they aren’t complete enough to be useful for your needs.
Across the full set of data, are there particular attributes that have many missing values? Identifying this is often a preliminary step to deciding how to treat missing data, whether, for example, to substitute a mean value for those records missing data in a record for the field. If so, those attributes are probably not very useful for you in performing further analysis. You may choose to exclude them. Missing values are probably something many people don’t think of as a data quality problem. But, thinking about the importance of certain elements of a data set for analysis, one realizes that it would be great to have all relevant data. This is why a count of missing values for a field across a set of data can be helpful in identifying the overall quality of that data.
This is similar to identifying the primary key of a relational database table and is the process of identifying those attribute or column values that are unique across a full set of data. Identifying those things that make records unique is important for the later activity of deduplicating data.
Sometimes you’ll have multiple sets of data from the same data source. You may have, for example, a file containing names and addresses and another file containing states and zip codes. The address file probably contains zip codes, which you can use to link to the file of states and zip codes, perhaps to use this second file as a way to validate the data in the first file.
Data profiling is clearly just a first step on the path to improving your data’s quality. In further entries in this series, we’ll look at other topics related to Data Quality, including Data Quality algorithms, dimensions of Data Quality, Functional Dependencies, and algorithms for deduplicating data.