聊天視窗

Data Science for Business Decision-Making: Turning Numbers into Strategic Insight - 第 2 章

Chapter 2: Data Fundamentals and Quality Assurance

發布於 2026-03-08 03:55

# Chapter 2: Data Fundamentals and Quality Assurance > **Data is the lifeblood of decision‑making. Its quality determines whether insights are actionable or misleading. This chapter lays the groundwork: understand what data looks like, how it arrives, and the processes that safeguard its integrity.** ## 1. What Is Data? | Category | Description | Example | |---|---|---| | **Structured** | Fixed schema, tabular format. | Customer table in a relational database. | | **Semi‑structured** | Some hierarchy or tagging but not rigid schema. | JSON logs, XML feeds. | | **Unstructured** | No inherent structure. | Text documents, images, audio. | > **Key Takeaway:** Every dataset fits one of these three buckets. Knowing the category informs the choice of storage, processing, and validation techniques. ## 2. Data Types and Structures | Primitive Type | Typical Use | Storage Formats | |---|---|---| | **Numeric** | Quantitative metrics (sales, scores). | CSV, Parquet, SQL numeric columns | | **Categorical** | Class labels, enums. | One‑hot encoded arrays, dictionary‑encoded columns | | **Text** | Descriptive fields, comments. | UTF‑8 strings, NoSQL text stores | | **Datetime** | Timestamps, dates. | ISO‑8601 strings, TIMESTAMP columns | | **Boolean** | Flags, true/false. | BIT or BOOL columns | ### 2.1 Common Data Structures - **Table (Rows × Columns)** – Most common; relational databases, Pandas DataFrames. - **Nested Objects** – Arrays of objects, typical in JSON; requires flattening for analysis. - **Graph** – Nodes and edges; used in social network analysis. - **Time Series** – Ordered by time; needs resampling, lag features. ## 3. Data Sources in Business Contexts | Source | Typical Data | Challenges | |---|---|---| | **Transactional Systems** | Sales, inventory logs. | High volume, real‑time ingestion. | **CRM / ERP** | Customer interactions, supply chain. | Data silos, inconsistent schemas. | **External APIs** | Market feeds, social media. | Rate limits, authentication. | **IoT Devices** | Sensor streams, telemetry. | Noise, missing values. | **Surveys / Forms** | Feedback, demographic. | Human‑error, missing responses. | **Public Datasets** | Economic indicators, weather. | Licensing, currency conversions. > **Pro Tip:** Maintain a *Data Source Registry* – a living document listing each source, schema, frequency, and owner. It becomes invaluable for onboarding and troubleshooting. ## 4. The Data Quality Triad 1. **Accuracy** – Correctness relative to reality. 2. **Completeness** – All required data present. 3. **Consistency** – Harmonized across systems. 4. **Timeliness** – Freshness of the data. 5. **Uniqueness** – No duplicate records. 6. **Validity** – Conformance to business rules. ### 4.1 Common Quality Issues | Issue | Impact | Example | |---|---|---| | Duplicate Rows | Skewed statistics | Same customer ID appears twice with slightly different names | | Null Values | Bias, missing insights | Missing age field for 20% of survey respondents | | Inconsistent Units | Misleading aggregates | Sales in USD and EUR mixed in same column | | Out‑of‑Range Values | Invalid predictions | Temperature recorded as 200°C | ## 5. Data Cleaning & Validation Pipeline python import pandas as pd from uuid import uuid4 # Load sample data df = pd.read_csv('customer_data.csv') # 1. Remove duplicates if df.duplicated(subset='customer_id').any(): df.drop_duplicates(subset='customer_id', keep='last', inplace=True) # 2. Standardize dates df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce') # 3. Handle missing values df['age'].fillna(df['age'].median(), inplace=True) # 4. Validate ranges assert df['age'].between(0, 120).all(), 'Age out of bounds' # 5. Encode categoricals df['country'] = df['country'].astype('category').cat.codes # 6. Store cleaned data df.to_parquet('clean_customer_data.parquet') > **Note:** Automate these steps via a *data‑quality service* that runs on ingestion, logs errors, and triggers alerts for manual review. ## 6. Data Governance Framework | Layer | Responsibility | Key Controls | |---|---|---| | **Policy** | Executive board | Data privacy, retention policies | | **Standards** | Data architects | Naming conventions, schema versioning | | **Operations** | Data engineers | Pipeline orchestration, audit trails | | **Security** | IAM & compliance | Role‑based access, encryption | | **Quality** | QA analysts | Validation scripts, dashboards | ### 6.1 Data Stewardship - **Stewards** are business owners accountable for a data domain. - **Roles**: Curator (maintains quality), Custodian (manages access), Analyst (extracts insights). ## 7. Data Lineage & Traceability Traceability ensures every record’s journey from source to final output is visible. mermaid graph LR A[Raw Data Store] --> B[ETL Pipeline] B --> C[Data Lake] C --> D[Data Warehouse] D --> E[Analytics] - **Metadata Catalog** – Store schema, transformations, and lineage. - **Versioning** – Tag datasets with commit hashes or timestamps. ## 8. Practical Checklist for Data Reliability | Task | Tool | Frequency | |---|---|---| | Verify schema consistency | JSON Schema, Avro | On ingestion | | Detect data drift | Datadog, Evidently | Daily | | Monitor missingness | Great Expectations | Weekly | | Validate business rules | Custom assertion scripts | On batch | | Audit access logs | Snowflake Access History | Monthly | ## 9. Conclusion Data quality is not a one‑off chore; it is an ongoing discipline that blends people, processes, and technology. By mastering the fundamentals—understanding data types, sources, and structures—and embedding robust validation and governance, analysts transform raw numbers into reliable insights that drive strategic decisions. > **Action Point:** Conduct a *Data Quality Health Check* for your current projects. Document issues, prioritize fixes, and schedule recurring reviews.