AI

假设空间

西瓜书第一章

完整源码见:CQU-AI/Watermelon-book-puzzles

假设空间-样本空间-版本空间自动生成器

生成器的结构

class DatasetSpace:
    """
    Automatically generate the sample_space, hypothesis_space and version_space for a data set
    """

    def __init__(self, dataset):
        """
        :param dataset: pd.DataFrame
        """

    def get_hypothesis_space(self):
        """
        :return: list (list hypothesis (unknown sample_value))
        """

    def get_sample_space(self):
        """
        :return: list (list sample (unknown sample_value))
        """

    def get_version_space(self):
        """
        Check the hypothesis_space with the samples, which will get version_space
        :return:
        """

    def get_possible_values(self):
        """
        Get the possible sample values
        :return: dic (key:(str feature)
                      value:(list possible_feature_values))
        """

    @property
    def hypothesis(self):

    @property
    def version(self):
       
    @property
    def sample(self):

代码结构很简单,但为了适应其他大小的数据集,枚举求样本空间的时候比较复杂,需要递归,经过大佬提醒,加入了一个迭代器(实质上是一个各位进制不同的数):

class Enumerator:
    """
    enumerator for hypothesis_space
    """

    def __init__(self, max):
        self.code = [0] * len(max)
        self.base = max
        self.overflow = False

    def next(self):
        if self.overflow:
            return -1

        self.code[0] += 1
        if not self.carry():
            self.overflow = True
            return self.code
        else:
            return self.code

    def carry(self):
        for i, b in enumerate(self.base):
            if self.code[i] == b:
                self.code[i] = 0
                if i + 1 >= len(self.code):
                    return False
                self.code[i + 1] += 1

                self.carry()
            else:
                return True

它能很好的对极小型数据集进行分析,进而自动生成假设空间-样本空间-版本空间,测试:

melon_data = pd.DataFrame(
        [
            ["青绿", "蜷缩", "浊响", True],
            ["乌黑", "蜷缩", "浊响", True],
            ["青绿", "硬挺", "清脆", False],
            ["乌黑", "稍蜷", "沉闷", False],
        ],
        columns=["色泽", "根蒂", "敲声", "target"],
    )

space = DatasetSpace(melon_data)

print("=" * 79 + "\n样本空间: \n" + space.sample)
print("=" * 79 + "\n合取假设的假设空间: \n" + space.hypothesis)
print("=" * 79 + "\n合取假设的版本空间: \n" + space.version)

析合范式长度与假设空间大小的关系计算

与使用单个合取式来进行假设表示相比,使用“析合范式”将使得假设空间具有更强的表示能力。若使用最多包含k个合取式的析合范式来表达1.1的西瓜分类问题的假设空间,试估算有多少种可能的假设。

思路:

  1. 对每一个假设编码出一个18位的二进制编号,其表征了该假设对18种可能输入的输出,因而任一个假设都拥有一个独特的二进制编号.
  2. k=1时的49个假设由1-1中的程序可以算出,再经过编码获得k=1的49个假设编号,使用这些假设编号进行互相析取,获得k=2的编号.
  3. 对获得的编号进行冗余检查,去掉编号相同的假设,即得到k=2的有效假设编号
  4. 重复将其与k=1的49个假设编号互相析取,直至18位的二进制编号被全部占用,即找到了全部假设.

代码结构:

class UnionHypothesisSpace:
    """
    Figure out how the number of hypothsis change with the disjunction become longer
    """

    def __init__(self, dataset=None):

    def get_conj_hypothesis(self):

    def get_sample_space(self):

    def encode(self, hypothesis):
        """
        Encode a hypothsis to a unique code
        :param hypothesis: list
        :return: int the unique code
        """

    def union(self, k):
        """
        Try union the hypothsis and skip redundant ones
        :param k:
        :return:
        """

    def run(self, k=1):
        """
        runner
        :param k: int the k to start
        :return: None
        """

    def load(self):
        """
        load result from res.json
        :return:
        """

    def save_report(self, k):
        """
        save result to res.json and report
        :param k: int current k
        :return:  None
        """

    def plot(self, k):
        """
        plot k-number of hypothsis
        :param k: int max number of disjunction term
        :return: None
        """

测试:

melon_data = pd.DataFrame(
    [
        ["青绿", "蜷缩", "浊响", True],
        ["乌黑", "蜷缩", "浊响", True],
        ["青绿", "硬挺", "清脆", False],
        ["乌黑", "稍蜷", "沉闷", False],
    ],
    columns=["色泽", "根蒂", "敲声", "target"],
)

space = UnionHypothesisSpace(melon_data)
space.run()

# when you have to stop & continue
# space = UnionHypothesisSpace()
# space.load()
# space.run(10)
本文总字数: 2815