Abstract: |
Correlation is a powerful relationship measure used in science, engineering, and business to estimate trends and make forecasts. When the data is complex, large and high dimensional, providing the method that improves correlation identification is challenging. There are many visualization methods are proposed to solve these problems but they still have limitations in the accuracy and speed. Therefore, we propose four visualization techniques to solve these problems in large and multi-dimensional data. Depending on purpose of visualization tasks, best fit technique can be provided to optimize the correlation identification performance.
Existing visualization methods, such as Scatterplots (SCPs) and Parallel Coordinates Plots (PCPs), are designed to be general supporting many visualization tasks, including identifying correlation. However, due in large part to their generality, they do not provide the most efficient interface, in terms of speed and accuracy, for many tasks. To improve correlation identification tasks in low level, we propose a new correlation task-specific visualization method called Correlation Coordinate Plots (CCPs). CCPs transform data into a powerful coordinate system for estimating the direction and strength of correlation between attributes.
However, correlation identification task is especially challenging as well when the number of dimensions is high, leading to many potential relationships, and/or multiway dependencies are of interest. Several visualization methods have been proposed to aid the exploration of such information through the direct visualization of summary statistics. However these methods are typically limited to the study of all possible pairwise and 3-way relationships and are rather rigid to interactive exploration to low-dimensional subspace.
Therefore, we propose three different visualization designs to optimize correlation identification task in large and multi-dimensional data. The first is the Snowflake Visualization, a focus+context layout for exploring all pairwise correlations. We also enhances the basic CCP interface by using principal component analysis to project multiple attributes.
The second proposed design is a new interactive design for representing and exploring data relationships in PCPs. The approach exploits the point/line duality property of PCPs and a local linear assumption of data to extract and to represent relationship summarizations. This approach simultaneously shows relationships in the data and the consistency of those relationships. Our approach supports various visualization tasks including cluster analysis, mixed linear and nonlinear pattern identification, hidden pattern detection, and outlier detection, all in large data.
Finally, we propose a novel technique for storing and accessing these multiway dependencies through visualization. Exploration is supported by a variety of operations placed on the complex, and interactive visualization enables flexible investigations through overview and detail views of the data.
We provide various use cases, compare to the prior works and user study to demonstrate how our proposed approaches helps to explore correlation in large and high dimensional data efficiently. These results confirmed that our approaches, CCP/Snowflake, DSPCP and MultiDepViz methods outperform some current visualization techniques such as SCP, PCP, SCP matrix, Corrgram, Angular Histogram (AngHist), and UntangleMap in both accuracy and timing.
|