You need to enable JavaScript to run this app.
导航
DataLeap数据仓库流程最佳实践
最近更新时间:2023.07.18 20:40:31首次发布时间:2023.07.18 20:40:31
前言

本实验以DataLeap on LAS为例,实际操作火山引擎数据产品,完成数据仓库的构建。

关于实验
  • 预计部署时间:50分钟

  • 级别:初级

  • 相关产品:大数据开发套件、湖仓一体分析服务LAS

  • 受众: 通用


环境说明

  1. 已购买DataLeap产品

  2. 已创建湖仓一体LAS队列

  3. 子账户具备DataLeap相关权限(参考:https://www.volcengine.com/docs/6260/65408)


实验说明

步骤1:创建项目





步骤2:计算资源组设置

本案例以湖仓一体Las为例,这里选择已创建的湖仓一体服务

完成上述步骤之后,创建好的DataLeap项目如下:


本Demo中以湖仓一体LAS的样例数据为实验数据
(TPC-DS中的样例表:https://www.volcengine.com/docs/6492/81953)

步骤3:导入样例数据


导入数据四张样例表:

样例中的四张表分别代表:

  • [事实表] Store_Sales: 销售记录表。

  • [维度表] Customers: 客户信息表。

  • [维度表] Stores: 商店信息表。

  • [维度表] Date_Dim: 时间信息表。


基于上述表数据,我们的数据分析需求如下:
1)“查看最近三天商店销售额情况(未促销)TOP3”
2)“查看最近三天消费最多的用户与金额TOP3”
3)“获取商店地域分布情况”

经典数据仓库按照大类分为基础数据层、应用数据层。



本样例中,我们的数据仓库建设思路是:

  • ODS(从生产系统采集原始数据,并将原始数据集成冗余宽表)

  • DWD(对ODS冗余表数据进行轻度过滤处理)

  • DWM (基于DWD表与业务需求,轻度聚合最近三天的数据)

  • APP (基于DWD或DWM,输出具体报表信息)


在“数据地图”中创建数据仓库中要使用到的表:

本案例中库信息为:demo_tpc_ds_2022_11_07_59(请结合具体情况修改)

步骤4: 数据仓库分层建表

ODS(数据聚合宽表)

CREATE TABLE demo_tpc_ds_2022_11_07_59.ods_demo_customer_store_sales_df (
    id bigint comment '主键',
    ss_sold_date_sk bigint comment '销售日期',
    ss_sold_time_sk bigint comment '销售时间',
    ss_item_sk bigint comment '销售物品',
    ss_customer_sk bigint comment '客户信息',
    ss_cdemo_sk bigint,
    ss_hdemo_sk bigint,
    ss_addr_sk bigint comment '地址信息',
    ss_store_sk bigint comment '商店信息',
    ss_promo_sk bigint comment '促销信息',
    ss_ticket_number bigint comment '订单号',
    ss_quantity int comment '数量',
    ss_wholesale_cost double comment '整个销售总额',
    ss_list_price double comment '列表价格',
    ss_sales_price double comment '销售价格',
    ss_ext_discount_amt double comment '外部折扣相关信息',
    ss_ext_sales_price double comment '外部销售信息',
    ss_ext_wholesale_cost double comment '外部整个销售总额',
    ss_ext_list_price double comment '外部列表价格',
    ss_ext_tax double comment '外部税相关',
    ss_coupon_amt double comment '打折券',
    ss_net_paid double comment '消费额',
    ss_net_paid_inc_tax double comment '消费税',
    ss_net_profit double comment '盈利',
    c_customer_sk bigint comment '客户信息',
    c_customer_id bigint comment '客户id',
    c_current_cdemo_sk bigint,
    c_current_hdemo_sk bigint,
    c_current_addr_sk bigint comment '地址信息',
    c_first_shipto_date_sk bigint comment '运输时间',
    c_first_sales_date_sk bigint comment '销售时间',
    c_salutation string comment '称呼',
    c_first_name string comment '名',
    c_last_name string comment '姓',
    c_preferred_cust_flag string comment '偏好标记',
    c_birth_day int comment '生日',
    c_birth_month int comment '生日月份',
    c_birth_year int comment '出生年',
    c_birth_country string comment '出生国家',
    c_login string comment '登录信息',
    c_email_address string comment '邮件地址',
    c_last_review_date_sk bigint comment '上次评价日期',
    s_store_sk bigint comment '商店',
    s_store_id string comment '商店id',
    s_rec_start_date date comment '商店开始时间',
    s_rec_end_date date comment '商店结束时间',
    s_closed_date_sk bigint comment '商店关闭时间',
    s_store_name string comment '商店名称',
    s_number_employees int comment '雇员数量',
    s_floor_space int,
    s_hours string comment '小时',
    s_manager string comment '经理',
    s_market_id int comment '市场信息',
    s_geography_class string comment '地理位置信息',
    s_market_desc string comment '市场描述',
    s_market_manager string comment '市场经理',
    s_division_id int,
    s_division_name string,
    s_company_id int comment '公司id',
    s_company_name string comment '公司名称',
    s_street_number string comment '街道编号',
    s_street_name string comment '街道名称',
    s_street_type string comment '街道类型',
    s_suite_number string comment '住宿编号',
    s_city string comment '城市',
    s_county string comment '国家',
    s_state string comment '洲',
    s_zip string comment '邮政编号',
    s_gmt_offset double,
    s_tax_precentage double comment '税收百分比',
    d_date_sk bigint comment '日期',
    d_date_id string comment '日期id',
    d_date date comment '具体date',
    d_month_seq int comment '第几月份',
    d_week_seq int comment '第几周',
    d_quarter_seq int comment '第几季度',
    d_year int comment '年份',
    d_dow int,
    d_moy int comment '月份相关信息',
    d_dom int,
    d_qoy int,
    d_fy_year int comment '年份相关信息',
    d_fy_quarter_seq int comment '季度序列',
    d_fy_week_seq int comment '周数序列',
    d_day_name string comment '显示的天数',
    d_quarter_name string comment '显示的季度',
    d_holiday string comment '节日',
    d_weekend string comment '周末',
    d_following_holiday string comment '后续节日信息',
    d_first_dom int,
    d_last_dom int,
    d_same_day_ly int comment '是否同一天',
    d_same_day_lq int comment '是否同一天',
    d_current_day string comment '是否为当前的天数',
    d_current_week string comment '是否为当前的周数',
    d_current_month string comment '是否为当前的月份',
    d_current_quarter string comment '是否为当前的季度',
    d_current_year string comment '是否为当前的年份'
)

DWD(轻度过滤)

CREATE TABLE demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df(
    id bigint comment '主键',
    ss_sold_date_sk bigint comment '销售日期',
    ss_sold_time_sk bigint comment '销售时间',
    ss_item_sk bigint comment '销售物品',
    ss_customer_sk bigint comment '客户信息',
    ss_cdemo_sk bigint,
    ss_hdemo_sk bigint,
    ss_addr_sk bigint comment '地址信息',
    ss_store_sk bigint comment '商店信息',
    ss_promo_sk bigint comment '促销信息',
    ss_ticket_number bigint comment '订单号',
    ss_quantity int comment '数量',
    ss_wholesale_cost double comment '整个销售总额',
    ss_list_price double comment '列表价格',
    ss_sales_price double comment '销售价格',
    ss_ext_discount_amt double comment '外部折扣相关信息',
    ss_ext_sales_price double comment '外部销售信息',
    ss_ext_wholesale_cost double comment '外部整个销售总额',
    ss_ext_list_price double comment '外部列表价格',
    ss_ext_tax double comment '外部税相关',
    ss_coupon_amt double comment '打折券',
    ss_net_paid double comment '消费额',
    ss_net_paid_inc_tax double comment '消费税',
    ss_net_profit double comment '盈利',
    c_customer_sk bigint comment '客户信息',
    c_customer_id bigint comment '客户id',
    c_current_cdemo_sk bigint,
    c_current_hdemo_sk bigint,
    c_current_addr_sk bigint comment '地址信息',
    c_first_shipto_date_sk bigint comment '运输时间',
    c_first_sales_date_sk bigint comment '销售时间',
    c_salutation string comment '称呼',
    c_first_name string comment '名',
    c_last_name string comment '姓',
    c_preferred_cust_flag string comment '偏好标记',
    c_birth_day int comment '生日',
    c_birth_month int comment '生日月份',
    c_birth_year int comment '出生年',
    c_birth_country string comment '出生国家',
    c_login string comment '登录信息',
    c_email_address string comment '邮件地址',
    c_last_review_date_sk bigint comment '上次评价日期',
    s_store_sk bigint comment '商店',
    s_store_id string comment '商店id',
    s_rec_start_date date comment '商店开始时间',
    s_rec_end_date date comment '商店结束时间',
    s_closed_date_sk bigint comment '商店关闭时间',
    s_store_name string comment '商店名称',
    s_number_employees int comment '雇员数量',
    s_floor_space int,
    s_hours string comment '小时',
    s_manager string comment '经理',
    s_market_id int comment '市场信息',
    s_geography_class string comment '地理位置信息',
    s_market_desc string comment '市场描述',
    s_market_manager string comment '市场经理',
    s_division_id int,
    s_division_name string,
    s_company_id int comment '公司id',
    s_company_name string comment '公司名称',
    s_street_number string comment '街道编号',
    s_street_name string comment '街道名称',
    s_street_type string comment '街道类型',
    s_suite_number string comment '住宿编号',
    s_city string comment '城市',
    s_county string comment '国家',
    s_state string comment '洲',
    s_zip string comment '邮政编号',
    s_gmt_offset double,
    s_tax_precentage double comment '税收百分比',
    d_date_sk bigint comment '日期',
    d_date_id string comment '日期id',
    d_date date comment '具体date',
    d_month_seq int comment '第几月份',
    d_week_seq int comment '第几周',
    d_quarter_seq int comment '第几季度',
    d_year int comment '年份',
    d_dow int,
    d_moy int comment '月份相关信息',
    d_dom int,
    d_qoy int,
    d_fy_year int comment '年份相关信息',
    d_fy_quarter_seq int comment '季度序列',
    d_fy_week_seq int comment '周数序列',
    d_day_name string comment '显示的天数',
    d_quarter_name string comment '显示的季度',
    d_holiday string comment '节日',
    d_weekend string comment '周末',
    d_following_holiday string comment '后续节日信息',
    d_first_dom int,
    d_last_dom int,
    d_same_day_ly int comment '是否同一天',
    d_same_day_lq int comment '是否同一天',
    d_current_day string comment '是否为当前的天数',
    d_current_week string comment '是否为当前的周数',
    d_current_month string comment '是否为当前的月份',
    d_current_quarter string comment '是否为当前的季度',
    d_current_year string comment '是否为当前的年份'
)

DWM(封装业务规则)

CREATE TABLE demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df (
    id bigint comment '主键',
    ss_sold_date_sk bigint comment '销售日期',
    ss_sold_time_sk bigint comment '销售时间',
    ss_item_sk bigint comment '销售物品',
    ss_customer_sk bigint comment '客户信息',
    ss_cdemo_sk bigint,
    ss_hdemo_sk bigint,
    ss_addr_sk bigint comment '地址信息',
    ss_store_sk bigint comment '商店信息',
    ss_promo_sk bigint comment '促销信息',
    ss_ticket_number bigint comment '订单号',
    ss_quantity int comment '数量',
    ss_wholesale_cost double comment '整个销售总额',
    ss_list_price double comment '列表价格',
    ss_sales_price double comment '销售价格',
    ss_ext_discount_amt double comment '外部折扣相关信息',
    ss_ext_sales_price double comment '外部销售信息',
    ss_ext_wholesale_cost double comment '外部整个销售总额',
    ss_ext_list_price double comment '外部列表价格',
    ss_ext_tax double comment '外部税相关',
    ss_coupon_amt double comment '打折券',
    ss_net_paid double comment '消费额',
    ss_net_paid_inc_tax double comment '消费税',
    ss_net_profit double comment '盈利',
    c_customer_sk bigint comment '客户信息',
    c_customer_id bigint comment '客户id',
    c_current_cdemo_sk bigint,
    c_current_hdemo_sk bigint,
    c_current_addr_sk bigint comment '地址信息',
    c_first_shipto_date_sk bigint comment '运输时间',
    c_first_sales_date_sk bigint comment '销售时间',
    c_salutation string comment '称呼',
    c_first_name string comment '名',
    c_last_name string comment '姓',
    c_preferred_cust_flag string comment '偏好标记',
    c_birth_day int comment '生日',
    c_birth_month int comment '生日月份',
    c_birth_year int comment '出生年',
    c_birth_country string comment '出生国家',
    c_login string comment '登录信息',
    c_email_address string comment '邮件地址',
    c_last_review_date_sk bigint comment '上次评价日期'
)

APP(差异化数据服务,提供看板数据)

CREATE TABLE demo_tpc_ds_2022_11_07_59.app_demo_ticket_top (
    store_sk bigint comment '商店信息',
    wholesale_cost double comment '整个销售总额'
)
CREATE TABLE demo_tpc_ds_2022_11_07_59.app_demo_customer_top3_recent3(
    customer_sk bigint comment '客户信息',
    salutation string comment '称呼',
    first_name string comment '名',
    last_name string comment '姓',
    wholesale_cost double comment '整个销售总额'
)
CREATE TABLE demo_tpc_ds_2022_11_07_59.app_demo_shop_distribution(
    county string comment '国家',
    distribution_count int comment '分布商店总数'
)

步骤5: 数仓任务构建

在任务开发中,新建“数据开发”任务:



ODS层数据任务

INSERT INTO demo_tpc_ds_2022_11_07_59.ods_demo_customer_store_sales_df
SELECT  row_number() OVER(
            ORDER BY
                    RAND()
        ) AS id,
        sales.ss_sold_date_sk,
        sales.ss_sold_time_sk,
        sales.ss_item_sk,
        sales.ss_customer_sk,
        sales.ss_cdemo_sk,
        sales.ss_hdemo_sk,
        sales.ss_addr_sk,
        sales.ss_store_sk,
        sales.ss_promo_sk,
        sales.ss_ticket_number,
        sales.ss_quantity,
        sales.ss_wholesale_cost,
        sales.ss_list_price,
        sales.ss_sales_price,
        sales.ss_ext_discount_amt,
        sales.ss_ext_sales_price,
        sales.ss_ext_wholesale_cost,
        sales.ss_ext_list_price,
        sales.ss_ext_tax,
        sales.ss_coupon_amt,
        sales.ss_net_paid,
        sales.ss_net_paid_inc_tax,
        sales.ss_net_profit,
        cust.c_customer_sk,
        cust.c_customer_id,
        cust.c_current_cdemo_sk,
        cust.c_current_hdemo_sk,
        cust.c_current_addr_sk,
        cust.c_first_shipto_date_sk,
        cust.c_first_sales_date_sk,
        cust.c_salutation,
        cust.c_first_name,
        cust.c_last_name,
        cust.c_preferred_cust_flag,
        cust.c_birth_day,
        cust.c_birth_month,
        cust.c_birth_year,
        cust.c_birth_country,
        cust.c_login,
        cust.c_email_address,
        cust.c_last_review_date_sk,
        stores.s_store_sk,
        stores.s_store_id,
        stores.s_rec_start_date,
        stores.s_rec_end_date,
        stores.s_closed_date_sk,
        stores.s_store_name,
        stores.s_number_employees,
        stores.s_floor_space,
        stores.s_hours,
        stores.s_manager,
        stores.s_market_id,
        stores.s_geography_class,
        stores.s_market_desc,
        stores.s_market_manager,
        stores.s_division_id,
        stores.s_division_name,
        stores.s_company_id,
        stores.s_company_name,
        stores.s_street_number,
        stores.s_street_name,
        stores.s_street_type,
        stores.s_suite_number,
        stores.s_city,
        stores.s_county,
        stores.s_state,
        stores.s_zip,
        stores.s_gmt_offset,
        stores.s_tax_precentage,
        dim.d_date_sk,
        dim.d_date_id,
        dim.d_date,
        dim.d_month_seq,
        dim.d_week_seq,
        dim.d_quarter_seq,
        dim.d_year,
        dim.d_dow,
        dim.d_moy,
        dim.d_dom,
        dim.d_qoy,
        dim.d_fy_year,
        dim.d_fy_quarter_seq,
        dim.d_fy_week_seq,
        dim.d_day_name,
        dim.d_quarter_name,
        dim.d_holiday,
        dim.d_weekend,
        dim.d_following_holiday,
        dim.d_first_dom,
        dim.d_last_dom,
        dim.d_same_day_ly,
        dim.d_same_day_lq,
        dim.d_current_day,
        dim.d_current_week,
        dim.d_current_month,
        dim.d_current_quarter,
        dim.d_current_year
FROM    demo_tpc_ds_2022_11_07_59.store_sales sales
JOIN    demo_tpc_ds_2022_11_07_59.customer cust
ON      sales.ss_customer_sk = cust.c_customer_sk
JOIN    demo_tpc_ds_2022_11_07_59.store stores
ON      sales.ss_store_sk = stores.s_store_sk
JOIN    demo_tpc_ds_2022_11_07_59.date_dim dim
ON      sales.ss_sold_date_sk = dim.d_date_sk

DWD层数据任务

INSERT INTO demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df
SELECT  row_number() OVER(
            ORDER BY
                    RAND()
        ) AS id,
        ss_sold_date_sk,
        ss_sold_time_sk,
        ss_item_sk,
        ss_customer_sk,
        ss_cdemo_sk,
        ss_hdemo_sk,
        ss_addr_sk,
        ss_store_sk,
        ss_promo_sk,
        ss_ticket_number,
        ss_quantity,
        ss_wholesale_cost,
        ss_list_price,
        ss_sales_price,
        ss_ext_discount_amt,
        ss_ext_sales_price,
        ss_ext_wholesale_cost,
        ss_ext_list_price,
        ss_ext_tax,
        ss_coupon_amt,
        ss_net_paid,
        ss_net_paid_inc_tax,
        ss_net_profit,
        c_customer_sk,
        c_customer_id,
        c_current_cdemo_sk,
        c_current_hdemo_sk,
        c_current_addr_sk,
        c_first_shipto_date_sk,
        c_first_sales_date_sk,
        c_salutation,
        c_first_name,
        c_last_name,
        c_preferred_cust_flag,
        c_birth_day,
        c_birth_month,
        c_birth_year,
        c_birth_country,
        c_login,
        c_email_address,
        c_last_review_date_sk,
        s_store_sk,
        s_store_id,
        s_rec_start_date,
        s_rec_end_date,
        s_closed_date_sk,
        s_store_name,
        s_number_employees,
        s_floor_space,
        s_hours,
        s_manager,
        s_market_id,
        s_geography_class,
        s_market_desc,
        s_market_manager,
        s_division_id,
        s_division_name,
        s_company_id,
        s_company_name,
        s_street_number,
        s_street_name,
        s_street_type,
        s_suite_number,
        s_city,
        s_county,
        s_state,
        s_zip,
        s_gmt_offset,
        s_tax_precentage,
        d_date_sk,
        d_date_id,
        d_date,
        d_month_seq,
        d_week_seq,
        d_quarter_seq,
        d_year,
        d_dow,
        d_moy,
        d_dom,
        d_qoy,
        d_fy_year,
        d_fy_quarter_seq,
        d_fy_week_seq,
        d_day_name,
        d_quarter_name,
        d_holiday,
        d_weekend,
        d_following_holiday,
        d_first_dom,
        d_last_dom,
        d_same_day_ly,
        d_same_day_lq,
        d_current_day,
        d_current_week,
        d_current_month,
        d_current_quarter,
        d_current_year
FROM    demo_tpc_ds_2022_11_07_59.ods_demo_customer_store_sales_df ods
WHERE   ods.ss_sold_date_sk IS NOT NULL
AND     ods.ss_customer_sk IS NOT NULL

DWM层数据任务

在需求一、二中,都有分析近三天数据的需求,因此在DWM层。
基于DWD层,筛选出近三天的数据用于构建DWM层,如果未来有基于近三天的需求(例如:查看最近3天地区购买排行榜),可以基于此分层进一步输出APP层数据报表。

INSERT INTO demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df
SELECT  row_number() OVER(
            ORDER BY
                    RAND()
        ) AS id,
        ss_sold_date_sk,
        ss_sold_time_sk,
        ss_item_sk,
        ss_customer_sk,
        ss_cdemo_sk,
        ss_hdemo_sk,
        ss_addr_sk,
        ss_store_sk,
        ss_promo_sk,
        ss_ticket_number,
        ss_quantity,
        ss_wholesale_cost,
        ss_list_price,
        ss_sales_price,
        ss_ext_discount_amt,
        ss_ext_sales_price,
        ss_ext_wholesale_cost,
        ss_ext_list_price,
        ss_ext_tax,
        ss_coupon_amt,
        ss_net_paid,
        ss_net_paid_inc_tax,
        ss_net_profit,
        c_customer_sk,
        c_customer_id,
        c_current_cdemo_sk,
        c_current_hdemo_sk,
        c_current_addr_sk,
        c_first_shipto_date_sk,
        c_first_sales_date_sk,
        c_salutation,
        c_first_name,
        c_last_name,
        c_preferred_cust_flag,
        c_birth_day,
        c_birth_month,
        c_birth_year,
        c_birth_country,
        c_login,
        c_email_address,
        c_last_review_date_sk
FROM    demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df dwd
WHERE   d_date IN (
            SELECT  DISTINCT d_date
            FROM    demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df
            ORDER BY
                    d_date
            LIMIT   3
        )

APP层数据任务

需求一:

INSERT INTO demo_tpc_ds_2022_11_07_59.app_demo_ticket_top
SELECT 
    ss_store_sk as store_sk,
    sum(ss_wholesale_cost) as wholesale_cost
 FROM
     demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df dwm
 WHERE
     dwm.ss_net_profit >0
 GROUP BY ss_store_sk

需求二:

INSERT INTO demo_tpc_ds_2022_11_07_59.app_demo_customer_top3_recent3
SELECT
    ss_customer_sk as customer_sk,
    c_salution as salution,
    c_first_name as first_name,
    c_last_name as last_name,
    sum(ss_wholesale_cost) as wholesale_cost
FROM 
    demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df dwm
GROUP BY
    ss_customer_sk,
    c_salution,
    c_first_name,
    c_last_name

需求三:

INSERT INTO demo_tpc_ds_2022_11_07_59.app_demo_shop_distribution
SELECT  s_county AS country,
        count(s_store_id) AS distribution_count
FROM    demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df dwd
GROUP BY
        dwd.s_county


步骤6:样例结果检查

需求一:

需求二:

需求三: