Fun

使用正则调整文章格式

python-re

问题

Sklearn中文文档的仓库里有大量公式被标记为了python代码段.导致网页上出现了这样: bug.png

解决

用一个脚本把所有的错误标记删掉就好了:

import re
import os


def remove_mark(matched):
    global md_file, flag
    flag = False
    removed_letter = re.sub(r'```\n|```py\n|```|py', "", matched.group())
    removed_letter = re.sub(r'\n+', "\n", removed_letter)
    with open("./docs/log.txt", 'a') as log:
        log.write('=' * 36 + md_file + removed_letter)
    return removed_letter


if __name__ == '__main__':
    for md_file in os.listdir('./docs'):
        if md_file.split('.')[-1] == 'md':
            with open('./docs/'+md_file, 'r+') as f:
                flag = True
                ori_md = f.read()
                fix_md = re.sub('```py\n+!(.|\n)+?```', remove_mark, ori_md)
                if not flag:
                    f.seek(0)
                    f.truncate()
                    f.write(fix_md)
                    print('DONE at', md_file)
                else:
                    print('No jod to do at', md_file)

主要包括了:

  1. 遍历所有markdown文件:
  2. 寻找符合”```py\n+!(.|\n)+?```“的错误标记段,若未找到,则不做更改并跳出
  3. 把错误标记段用remove_mark函数修正(取出标记,整理换行)
  4. 将修正好的标记段写到log.txt里,方便人工快速复查
  5. 将调整好的文字写回(已置空的)文件里
本文总字数: 797