Data Analysis & Warehousing
Introduction to Data Warehousing
Section titled “Introduction to Data Warehousing”A Data Warehouse (DW) is a centralized repository that stores integrated data from multiple heterogeneous sources such as databases, transactional systems, and external files. It is designed specifically for querying and analysis, not for transaction processing.
Purpose
Section titled “Purpose”- To support decision-making and business intelligence (BI).
- To provide a historical, subject-oriented, integrated, and time-variant view of data.
Key Characteristics
Section titled “Key Characteristics”| Property | Description |
|---|---|
| Subject-oriented | Organized around major business subjects like sales, finance, or customers. |
| Integrated | Data is collected from various sources and standardized. |
| Time-variant | Maintains historical data (e.g., yearly, quarterly trends). |
| Non-volatile | Once data is stored, it is not updated or deleted frequently. |
Architecture of a Data Warehouse
Section titled “Architecture of a Data Warehouse”-
Data Sources: Operational databases, flat files, web logs, etc.
-
ETL Process (Extract, Transform, Load):
- Extract: Retrieve data from various sources.
- Transform: Clean, filter, and format data.
- Load: Store transformed data into the warehouse.
-
Data Storage Area: The central repository storing both current and historical data.
-
OLAP Engine (Online Analytical Processing): Supports complex queries, multidimensional analysis, and data summarization.
-
Front-end Tools: Dashboards, reports, and data visualization tools for business analysis.
Schemas Used
Section titled “Schemas Used”| Schema | Structure | Description |
|---|---|---|
| Star Schema | Central fact table linked to dimension tables. | Simplest and widely used. |
| Snowflake Schema | Dimension tables are normalized into multiple related tables. | Reduces redundancy. |
| Galaxy Schema | Multiple fact tables share dimension tables. | Used for complex data marts. |
Benefits
Section titled “Benefits”- Enhances data analysis and reporting.
- Improves decision-making and forecasting.
- Provides data consistency across the organization.
- Supports historical and trend analysis.
Introduction to Data Mining
Section titled “Introduction to Data Mining”Data Mining is the process of extracting hidden patterns, trends, and knowledge from large datasets using statistical, mathematical, and machine learning techniques.
It is often considered as “Knowledge Discovery in Databases (KDD)”.
Steps in Data Mining (KDD Process)
Section titled “Steps in Data Mining (KDD Process)”- Data Cleaning: Remove noise and inconsistencies.
- Data Integration: Combine data from multiple sources.
- Data Selection: Retrieve relevant data for analysis.
- Data Transformation: Convert data into appropriate forms for mining.
- Data Mining: Apply algorithms to extract patterns (e.g., clustering, classification).
- Pattern Evaluation: Identify interesting and useful patterns.
- Knowledge Presentation: Visualize the results for decision-making.
Techniques of Data Mining
Section titled “Techniques of Data Mining”| Technique | Description | Example |
|---|---|---|
| Classification | Assigns data into predefined categories. | Spam vs. Non-spam email. |
| Clustering | Groups similar data items together. | Customer segmentation. |
| Association Rule Mining | Discovers relationships between items. | “If customer buys bread, they also buy butter.” |
| Regression | Predicts continuous values. | Predicting sales revenue. |
| Anomaly Detection | Identifies outliers or unusual data points. | Fraud detection. |
Applications
Section titled “Applications”- Business Intelligence: Market basket analysis, sales forecasting.
- Finance: Credit scoring, risk analysis.
- Healthcare: Disease prediction, patient profiling.
- Web Mining: User behavior analysis, recommendation systems.
Difference between Data Warehousing and Data Mining
Section titled “Difference between Data Warehousing and Data Mining”| Aspect | Data Warehousing | Data Mining |
|---|---|---|
| Purpose | Store and organize data | Analyze and discover patterns |
| Process Type | Data storage | Data analysis |
| Tools Used | OLAP, ETL | ML algorithms, statistics |
| Outcome | Historical data repository | Actionable insights and predictions |