KDD标准流程

ACA课程学习笔记(11)

CRISP-DM(跨行业数据挖掘标准流程)

Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.

  • 过程描述
    • 业务理解-Business Understanding
      • 理解项目目标
      • 理解业务需求
      • 转化为技术需求
    • 数据理解-Data Understanding
      • 数据搜集
      • 熟悉数据
      • 识别质量
      • 了解大数据属性
    • 数据准备-Data Preparation
      • 构造输入数据集(选择,转换,清洗等步骤)
    • 数据建模-Modeling
      • 选择、应用不同模型、算法
      • 调参
    • 模型评估-Evaluation
      • 确定能完成业务目标
    • 方案实施-Deployment

SEMMA

SEMMA is an acronym that stands for Sample, Explore, Modify, Model, and Assess. It is a list of sequential steps developed by SAS Institute, one of the largest producers of statistics and business intelligence software. It guides the implementation of data mining applications. Although SEMMA is often considered to be a general data mining methodology, SAS claims that it is “rather a logical organization of the functional tool set of” one of their products, SAS Enterprise Miner, “for carrying out the core tasks of data mining”.

  • 过程描述
    • 数据采样-Sample
    使用合适的采样方法从数据中采样,注意数据质量    - 数据探索-**Explore**

    通过探索式数据分析、可视化等技术发现数据的特征、及相关关系    - 问题明确,数据调整,技术选择-**Modify**

    将问题明确,调整所需数据集,明确技术手段    - 模型研发-**Model**

    选择合适算法、模型,调参    - 模型评价-**Assess**

    评估模型效果,对模型进行针对业务的解释和应用

实际实施的八步法

业务理解->指标设计->数据提取->数据探索->算法选择->模型评估->模型发布->模型优化

本文总字数: 1115