本实验以DataLeap on LAS为例,实际操作火山引擎数据产品,完成数据仓库的构建。
预计部署时间:50分钟
级别:初级
相关产品:大数据开发套件、湖仓一体分析服务LAS
受众: 通用
已购买DataLeap产品
已创建湖仓一体LAS队列
子账户具备DataLeap相关权限(参考:https://www.volcengine.com/docs/6260/65408)
本案例以湖仓一体Las为例,这里选择已创建的湖仓一体服务
完成上述步骤之后,创建好的DataLeap项目如下:
本Demo中以湖仓一体LAS的样例数据为实验数据
(TPC-DS中的样例表:https://www.volcengine.com/docs/6492/81953)
导入数据四张样例表:
样例中的四张表分别代表:
[事实表] Store_Sales: 销售记录表。
[维度表] Customers: 客户信息表。
[维度表] Stores: 商店信息表。
[维度表] Date_Dim: 时间信息表。
基于上述表数据,我们的数据分析需求如下:
1)“查看最近三天商店销售额情况(未促销)TOP3”
2)“查看最近三天消费最多的用户与金额TOP3”
3)“获取商店地域分布情况”
经典数据仓库按照大类分为基础数据层、应用数据层。
本样例中,我们的数据仓库建设思路是:
ODS(从生产系统采集原始数据,并将原始数据集成冗余宽表)
DWD(对ODS冗余表数据进行轻度过滤处理)
DWM (基于DWD表与业务需求,轻度聚合最近三天的数据)
APP (基于DWD或DWM,输出具体报表信息)
在“数据地图”中创建数据仓库中要使用到的表:
本案例中库信息为:demo_tpc_ds_2022_11_07_59(请结合具体情况修改)
CREATE TABLE demo_tpc_ds_2022_11_07_59.ods_demo_customer_store_sales_df ( id bigint comment '主键', ss_sold_date_sk bigint comment '销售日期', ss_sold_time_sk bigint comment '销售时间', ss_item_sk bigint comment '销售物品', ss_customer_sk bigint comment '客户信息', ss_cdemo_sk bigint, ss_hdemo_sk bigint, ss_addr_sk bigint comment '地址信息', ss_store_sk bigint comment '商店信息', ss_promo_sk bigint comment '促销信息', ss_ticket_number bigint comment '订单号', ss_quantity int comment '数量', ss_wholesale_cost double comment '整个销售总额', ss_list_price double comment '列表价格', ss_sales_price double comment '销售价格', ss_ext_discount_amt double comment '外部折扣相关信息', ss_ext_sales_price double comment '外部销售信息', ss_ext_wholesale_cost double comment '外部整个销售总额', ss_ext_list_price double comment '外部列表价格', ss_ext_tax double comment '外部税相关', ss_coupon_amt double comment '打折券', ss_net_paid double comment '消费额', ss_net_paid_inc_tax double comment '消费税', ss_net_profit double comment '盈利', c_customer_sk bigint comment '客户信息', c_customer_id bigint comment '客户id', c_current_cdemo_sk bigint, c_current_hdemo_sk bigint, c_current_addr_sk bigint comment '地址信息', c_first_shipto_date_sk bigint comment '运输时间', c_first_sales_date_sk bigint comment '销售时间', c_salutation string comment '称呼', c_first_name string comment '名', c_last_name string comment '姓', c_preferred_cust_flag string comment '偏好标记', c_birth_day int comment '生日', c_birth_month int comment '生日月份', c_birth_year int comment '出生年', c_birth_country string comment '出生国家', c_login string comment '登录信息', c_email_address string comment '邮件地址', c_last_review_date_sk bigint comment '上次评价日期', s_store_sk bigint comment '商店', s_store_id string comment '商店id', s_rec_start_date date comment '商店开始时间', s_rec_end_date date comment '商店结束时间', s_closed_date_sk bigint comment '商店关闭时间', s_store_name string comment '商店名称', s_number_employees int comment '雇员数量', s_floor_space int, s_hours string comment '小时', s_manager string comment '经理', s_market_id int comment '市场信息', s_geography_class string comment '地理位置信息', s_market_desc string comment '市场描述', s_market_manager string comment '市场经理', s_division_id int, s_division_name string, s_company_id int comment '公司id', s_company_name string comment '公司名称', s_street_number string comment '街道编号', s_street_name string comment '街道名称', s_street_type string comment '街道类型', s_suite_number string comment '住宿编号', s_city string comment '城市', s_county string comment '国家', s_state string comment '洲', s_zip string comment '邮政编号', s_gmt_offset double, s_tax_precentage double comment '税收百分比', d_date_sk bigint comment '日期', d_date_id string comment '日期id', d_date date comment '具体date', d_month_seq int comment '第几月份', d_week_seq int comment '第几周', d_quarter_seq int comment '第几季度', d_year int comment '年份', d_dow int, d_moy int comment '月份相关信息', d_dom int, d_qoy int, d_fy_year int comment '年份相关信息', d_fy_quarter_seq int comment '季度序列', d_fy_week_seq int comment '周数序列', d_day_name string comment '显示的天数', d_quarter_name string comment '显示的季度', d_holiday string comment '节日', d_weekend string comment '周末', d_following_holiday string comment '后续节日信息', d_first_dom int, d_last_dom int, d_same_day_ly int comment '是否同一天', d_same_day_lq int comment '是否同一天', d_current_day string comment '是否为当前的天数', d_current_week string comment '是否为当前的周数', d_current_month string comment '是否为当前的月份', d_current_quarter string comment '是否为当前的季度', d_current_year string comment '是否为当前的年份' )
CREATE TABLE demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df( id bigint comment '主键', ss_sold_date_sk bigint comment '销售日期', ss_sold_time_sk bigint comment '销售时间', ss_item_sk bigint comment '销售物品', ss_customer_sk bigint comment '客户信息', ss_cdemo_sk bigint, ss_hdemo_sk bigint, ss_addr_sk bigint comment '地址信息', ss_store_sk bigint comment '商店信息', ss_promo_sk bigint comment '促销信息', ss_ticket_number bigint comment '订单号', ss_quantity int comment '数量', ss_wholesale_cost double comment '整个销售总额', ss_list_price double comment '列表价格', ss_sales_price double comment '销售价格', ss_ext_discount_amt double comment '外部折扣相关信息', ss_ext_sales_price double comment '外部销售信息', ss_ext_wholesale_cost double comment '外部整个销售总额', ss_ext_list_price double comment '外部列表价格', ss_ext_tax double comment '外部税相关', ss_coupon_amt double comment '打折券', ss_net_paid double comment '消费额', ss_net_paid_inc_tax double comment '消费税', ss_net_profit double comment '盈利', c_customer_sk bigint comment '客户信息', c_customer_id bigint comment '客户id', c_current_cdemo_sk bigint, c_current_hdemo_sk bigint, c_current_addr_sk bigint comment '地址信息', c_first_shipto_date_sk bigint comment '运输时间', c_first_sales_date_sk bigint comment '销售时间', c_salutation string comment '称呼', c_first_name string comment '名', c_last_name string comment '姓', c_preferred_cust_flag string comment '偏好标记', c_birth_day int comment '生日', c_birth_month int comment '生日月份', c_birth_year int comment '出生年', c_birth_country string comment '出生国家', c_login string comment '登录信息', c_email_address string comment '邮件地址', c_last_review_date_sk bigint comment '上次评价日期', s_store_sk bigint comment '商店', s_store_id string comment '商店id', s_rec_start_date date comment '商店开始时间', s_rec_end_date date comment '商店结束时间', s_closed_date_sk bigint comment '商店关闭时间', s_store_name string comment '商店名称', s_number_employees int comment '雇员数量', s_floor_space int, s_hours string comment '小时', s_manager string comment '经理', s_market_id int comment '市场信息', s_geography_class string comment '地理位置信息', s_market_desc string comment '市场描述', s_market_manager string comment '市场经理', s_division_id int, s_division_name string, s_company_id int comment '公司id', s_company_name string comment '公司名称', s_street_number string comment '街道编号', s_street_name string comment '街道名称', s_street_type string comment '街道类型', s_suite_number string comment '住宿编号', s_city string comment '城市', s_county string comment '国家', s_state string comment '洲', s_zip string comment '邮政编号', s_gmt_offset double, s_tax_precentage double comment '税收百分比', d_date_sk bigint comment '日期', d_date_id string comment '日期id', d_date date comment '具体date', d_month_seq int comment '第几月份', d_week_seq int comment '第几周', d_quarter_seq int comment '第几季度', d_year int comment '年份', d_dow int, d_moy int comment '月份相关信息', d_dom int, d_qoy int, d_fy_year int comment '年份相关信息', d_fy_quarter_seq int comment '季度序列', d_fy_week_seq int comment '周数序列', d_day_name string comment '显示的天数', d_quarter_name string comment '显示的季度', d_holiday string comment '节日', d_weekend string comment '周末', d_following_holiday string comment '后续节日信息', d_first_dom int, d_last_dom int, d_same_day_ly int comment '是否同一天', d_same_day_lq int comment '是否同一天', d_current_day string comment '是否为当前的天数', d_current_week string comment '是否为当前的周数', d_current_month string comment '是否为当前的月份', d_current_quarter string comment '是否为当前的季度', d_current_year string comment '是否为当前的年份' )
CREATE TABLE demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df ( id bigint comment '主键', ss_sold_date_sk bigint comment '销售日期', ss_sold_time_sk bigint comment '销售时间', ss_item_sk bigint comment '销售物品', ss_customer_sk bigint comment '客户信息', ss_cdemo_sk bigint, ss_hdemo_sk bigint, ss_addr_sk bigint comment '地址信息', ss_store_sk bigint comment '商店信息', ss_promo_sk bigint comment '促销信息', ss_ticket_number bigint comment '订单号', ss_quantity int comment '数量', ss_wholesale_cost double comment '整个销售总额', ss_list_price double comment '列表价格', ss_sales_price double comment '销售价格', ss_ext_discount_amt double comment '外部折扣相关信息', ss_ext_sales_price double comment '外部销售信息', ss_ext_wholesale_cost double comment '外部整个销售总额', ss_ext_list_price double comment '外部列表价格', ss_ext_tax double comment '外部税相关', ss_coupon_amt double comment '打折券', ss_net_paid double comment '消费额', ss_net_paid_inc_tax double comment '消费税', ss_net_profit double comment '盈利', c_customer_sk bigint comment '客户信息', c_customer_id bigint comment '客户id', c_current_cdemo_sk bigint, c_current_hdemo_sk bigint, c_current_addr_sk bigint comment '地址信息', c_first_shipto_date_sk bigint comment '运输时间', c_first_sales_date_sk bigint comment '销售时间', c_salutation string comment '称呼', c_first_name string comment '名', c_last_name string comment '姓', c_preferred_cust_flag string comment '偏好标记', c_birth_day int comment '生日', c_birth_month int comment '生日月份', c_birth_year int comment '出生年', c_birth_country string comment '出生国家', c_login string comment '登录信息', c_email_address string comment '邮件地址', c_last_review_date_sk bigint comment '上次评价日期' )
CREATE TABLE demo_tpc_ds_2022_11_07_59.app_demo_ticket_top ( store_sk bigint comment '商店信息', wholesale_cost double comment '整个销售总额' )
CREATE TABLE demo_tpc_ds_2022_11_07_59.app_demo_customer_top3_recent3( customer_sk bigint comment '客户信息', salutation string comment '称呼', first_name string comment '名', last_name string comment '姓', wholesale_cost double comment '整个销售总额' )
CREATE TABLE demo_tpc_ds_2022_11_07_59.app_demo_shop_distribution( county string comment '国家', distribution_count int comment '分布商店总数' )
在任务开发中,新建“数据开发”任务:
INSERT INTO demo_tpc_ds_2022_11_07_59.ods_demo_customer_store_sales_df SELECT row_number() OVER( ORDER BY RAND() ) AS id, sales.ss_sold_date_sk, sales.ss_sold_time_sk, sales.ss_item_sk, sales.ss_customer_sk, sales.ss_cdemo_sk, sales.ss_hdemo_sk, sales.ss_addr_sk, sales.ss_store_sk, sales.ss_promo_sk, sales.ss_ticket_number, sales.ss_quantity, sales.ss_wholesale_cost, sales.ss_list_price, sales.ss_sales_price, sales.ss_ext_discount_amt, sales.ss_ext_sales_price, sales.ss_ext_wholesale_cost, sales.ss_ext_list_price, sales.ss_ext_tax, sales.ss_coupon_amt, sales.ss_net_paid, sales.ss_net_paid_inc_tax, sales.ss_net_profit, cust.c_customer_sk, cust.c_customer_id, cust.c_current_cdemo_sk, cust.c_current_hdemo_sk, cust.c_current_addr_sk, cust.c_first_shipto_date_sk, cust.c_first_sales_date_sk, cust.c_salutation, cust.c_first_name, cust.c_last_name, cust.c_preferred_cust_flag, cust.c_birth_day, cust.c_birth_month, cust.c_birth_year, cust.c_birth_country, cust.c_login, cust.c_email_address, cust.c_last_review_date_sk, stores.s_store_sk, stores.s_store_id, stores.s_rec_start_date, stores.s_rec_end_date, stores.s_closed_date_sk, stores.s_store_name, stores.s_number_employees, stores.s_floor_space, stores.s_hours, stores.s_manager, stores.s_market_id, stores.s_geography_class, stores.s_market_desc, stores.s_market_manager, stores.s_division_id, stores.s_division_name, stores.s_company_id, stores.s_company_name, stores.s_street_number, stores.s_street_name, stores.s_street_type, stores.s_suite_number, stores.s_city, stores.s_county, stores.s_state, stores.s_zip, stores.s_gmt_offset, stores.s_tax_precentage, dim.d_date_sk, dim.d_date_id, dim.d_date, dim.d_month_seq, dim.d_week_seq, dim.d_quarter_seq, dim.d_year, dim.d_dow, dim.d_moy, dim.d_dom, dim.d_qoy, dim.d_fy_year, dim.d_fy_quarter_seq, dim.d_fy_week_seq, dim.d_day_name, dim.d_quarter_name, dim.d_holiday, dim.d_weekend, dim.d_following_holiday, dim.d_first_dom, dim.d_last_dom, dim.d_same_day_ly, dim.d_same_day_lq, dim.d_current_day, dim.d_current_week, dim.d_current_month, dim.d_current_quarter, dim.d_current_year FROM demo_tpc_ds_2022_11_07_59.store_sales sales JOIN demo_tpc_ds_2022_11_07_59.customer cust ON sales.ss_customer_sk = cust.c_customer_sk JOIN demo_tpc_ds_2022_11_07_59.store stores ON sales.ss_store_sk = stores.s_store_sk JOIN demo_tpc_ds_2022_11_07_59.date_dim dim ON sales.ss_sold_date_sk = dim.d_date_sk
INSERT INTO demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df SELECT row_number() OVER( ORDER BY RAND() ) AS id, ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, ss_quantity, ss_wholesale_cost, ss_list_price, ss_sales_price, ss_ext_discount_amt, ss_ext_sales_price, ss_ext_wholesale_cost, ss_ext_list_price, ss_ext_tax, ss_coupon_amt, ss_net_paid, ss_net_paid_inc_tax, ss_net_profit, c_customer_sk, c_customer_id, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk, c_first_sales_date_sk, c_salutation, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_day, c_birth_month, c_birth_year, c_birth_country, c_login, c_email_address, c_last_review_date_sk, s_store_sk, s_store_id, s_rec_start_date, s_rec_end_date, s_closed_date_sk, s_store_name, s_number_employees, s_floor_space, s_hours, s_manager, s_market_id, s_geography_class, s_market_desc, s_market_manager, s_division_id, s_division_name, s_company_id, s_company_name, s_street_number, s_street_name, s_street_type, s_suite_number, s_city, s_county, s_state, s_zip, s_gmt_offset, s_tax_precentage, d_date_sk, d_date_id, d_date, d_month_seq, d_week_seq, d_quarter_seq, d_year, d_dow, d_moy, d_dom, d_qoy, d_fy_year, d_fy_quarter_seq, d_fy_week_seq, d_day_name, d_quarter_name, d_holiday, d_weekend, d_following_holiday, d_first_dom, d_last_dom, d_same_day_ly, d_same_day_lq, d_current_day, d_current_week, d_current_month, d_current_quarter, d_current_year FROM demo_tpc_ds_2022_11_07_59.ods_demo_customer_store_sales_df ods WHERE ods.ss_sold_date_sk IS NOT NULL AND ods.ss_customer_sk IS NOT NULL
在需求一、二中,都有分析近三天数据的需求,因此在DWM层。
基于DWD层,筛选出近三天的数据用于构建DWM层,如果未来有基于近三天的需求(例如:查看最近3天地区购买排行榜),可以基于此分层进一步输出APP层数据报表。
INSERT INTO demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df SELECT row_number() OVER( ORDER BY RAND() ) AS id, ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, ss_quantity, ss_wholesale_cost, ss_list_price, ss_sales_price, ss_ext_discount_amt, ss_ext_sales_price, ss_ext_wholesale_cost, ss_ext_list_price, ss_ext_tax, ss_coupon_amt, ss_net_paid, ss_net_paid_inc_tax, ss_net_profit, c_customer_sk, c_customer_id, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk, c_first_sales_date_sk, c_salutation, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_day, c_birth_month, c_birth_year, c_birth_country, c_login, c_email_address, c_last_review_date_sk FROM demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df dwd WHERE d_date IN ( SELECT DISTINCT d_date FROM demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df ORDER BY d_date LIMIT 3 )
需求一:
INSERT INTO demo_tpc_ds_2022_11_07_59.app_demo_ticket_top SELECT ss_store_sk as store_sk, sum(ss_wholesale_cost) as wholesale_cost FROM demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df dwm WHERE dwm.ss_net_profit >0 GROUP BY ss_store_sk
需求二:
INSERT INTO demo_tpc_ds_2022_11_07_59.app_demo_customer_top3_recent3 SELECT ss_customer_sk as customer_sk, c_salution as salution, c_first_name as first_name, c_last_name as last_name, sum(ss_wholesale_cost) as wholesale_cost FROM demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df dwm GROUP BY ss_customer_sk, c_salution, c_first_name, c_last_name
需求三:
INSERT INTO demo_tpc_ds_2022_11_07_59.app_demo_shop_distribution SELECT s_county AS country, count(s_store_id) AS distribution_count FROM demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df dwd GROUP BY dwd.s_county
需求一:
需求二:
需求三: