Data Profiling is a technique for detecting and investigating data quality issues such as duplication, lack of consistency, lack of accuracy and lack of completeness. Data profiling is done by analyzing one or more data sources and collecting metadata showing the status of the data, allowing data managers to investigate the cause of data errors. Data profiling allows you to view data statistics such as the degree of redundancy and percentage of attribute values in tabular and graphical format.
- Collecting descriptive statistics such as min, max, count and sum
- Data type, length and repetition pattern collection
- Tagging data by keyword, description or category
- Risk of performing data quality evaluation and performing data join
- Discovering metadata and evaluate its accuracy
- Identifying distributions, key candidates, foreign key candidates, functional dependencies, implicit value dependencies, and perform cross-table analysis
There are three main types of data profiling.
- Structure discovery
Make sure the data is consistent and well-formed, and perform mathematical checks on the data (e.g. sum, min or max). Confirming data structure can help you understand how well your data is structured. For example, you can find out the percentage of phone numbers with incorrect digits.
- Content search
Individual data records are examined to detect errors. Content search identifies whether there is a problem with a specific row in the table, and system problems arising from the data (e.g. phone numbers without area codes)
- Relationship discovery
Discover how parts of your data are interrelated. For example, discover key relationships between database tables, cells in a spreadsheet, or references between tables, etc. Understanding relationships is important to reuse data. And, you need to bring the relevant data sources together in a way that either integrates them or maintains important relationships.
CLICK AI goes through the process of taking a closer look at individual elements of the database to check the quality of the data. This allows you to find and eliminate or fix areas with null values, or invalid or ambiguous values.