BigData_DW_Real Document Overview

The document BigData_DW_Real.docx provides an extensive guide on big data processing architectures, covering both offline and real-time processing architectures. Additionally, it details the requirements overview and architectural design of a big data warehouse project.


Big Data Processing Architectures

Big data processing architectures are primarily classified into two types:

  1. Offline Processing Architecture
  2. Utilized for data post-analysis and data mining applications.
  3. Technologies: Hive, Map/Reduce, Spark SQL, etc.
  4. Advantages: Capable of handling large volumes of data.
  5. Disadvantages: Slower processing speed, less sensitive to real-time demands.

  6. Real-Time Processing Architecture

  7. Suited for real-time monitoring and interactive applications.
  8. Technologies: Spark Streaming, Flink.
  9. Advantages: High responsiveness for time-sensitive data.
  10. Disadvantages: Faster processing but limited to simpler business logic.

Big Data Warehouse Project Requirements

The big data warehouse project encompasses six key requirements:

  1. Daily Active Users: Analysis with hourly trends and daily comparisons.
  2. Daily New Users: Analysis with hourly trends and daily comparisons.
  3. Daily Transaction Volume: Analysis with hourly trends and daily comparisons.
  4. Daily Order Count: Analysis with hourly trends and daily comparisons.
  5. Shopping Coupon Risk Warning: Function for identifying potential risks.
  6. Flexible User Purchase Analysis: Customizable analysis functionality.

Architectural Design for Big Data Warehouse Project

  • Main Project (gmall): Based on Spring Boot.
  • Dependencies: Incorporates Spark, Scala, Log4j, Slf4j, Fastjson, Httpclient.
  • Project Structure: Includes parent project, submodules, and dependencies.

Technology Versions:

- Spark: 2.1.1

- Scala: 2.11.8

- Log4j: 1.2.17

- Slf4j: 1.7.22

- Fastjson: 1.2.47

- Httpclient: 4.5.5

- Httpmime: 4.3.6

- Java: 1.8