自动爬取 Leetcode 题目,并保存为 Markdown 格式

本文最后更新于:2021年1月9日 晚上

复制,粘贴,运行

前言

leetcode 刷题的时候我习惯在本地写好题解,不过每题都复制就太麻烦了,而且也不是 Markdown 格式的。于是想能不能爬取内容,然后自动转换格式。

正文

自己来写的话因为对爬虫并不是很熟悉(主要是对于抓包并不熟),所以直接搜索,于是发现了一个 博客 写这个事情。看起来挺不错的,那我就直接用 CV 大法了。这里主要是用了转换格式和爬取内容的方法。

原理

leetcode 的请求都是用的 GraphQL 技术。相比起 RESTful 格式而言,它更加灵活,能够减少请求数(一次就能查询多个数据)。查询的时候可以只请求部分内容。

对于本例而言,用 F12 ,然后看 Network,搜索 graphql 的请求后,筛选一下就能得到合适的结果

curl 'https://leetcode-cn.com/graphql/' \
  -H 'authority: leetcode-cn.com' \
  -H 'x-timezone: undefined' \
  -H 'x-operation-name: questionData' \
  -H 'accept-language: zh-CN' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \
  -H 'content-type: application/json' \
  -H 'accept: */*' \
  -H 'x-csrftoken: balabalabala' \
  -H 'dnt: 1' \
  -H 'x-definition-name: question' \
  -H 'origin: https://leetcode-cn.com' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'referer: https://leetcode-cn.com/problems/best-time-to-buy-and-sell-stock-iii/' \
  -H 'cookie: gbalabalabala' \
  --data-binary $'{"operationName":"questionData","variables":{"titleSlug":"best-time-to-buy-and-sell-stock-iii"},"query":"query questionData($titleSlug: String\u0021) {\\n  question(titleSlug: $titleSlug) {\\n    questionId\\n    questionFrontendId\\n    boundTopicId\\n    title\\n    titleSlug\\n    content\\n    translatedTitle\\n    translatedContent\\n    isPaidOnly\\n    difficulty\\n    likes\\n    dislikes\\n    isLiked\\n    similarQuestions\\n    contributors {\\n      username\\n      profileUrl\\n      avatarUrl\\n      __typename\\n    }\\n    langToValidPlayground\\n    topicTags {\\n      name\\n      slug\\n      translatedName\\n      __typename\\n    }\\n    companyTagStats\\n    codeSnippets {\\n      lang\\n      langSlug\\n      code\\n      __typename\\n    }\\n    stats\\n    hints\\n    solution {\\n      id\\n      canSeeDetail\\n      __typename\\n    }\\n    status\\n    sampleTestCase\\n    metaData\\n    judgerAvailable\\n    judgeType\\n    mysqlSchemas\\n    enableRunCode\\n    envInfo\\n    book {\\n      id\\n      bookName\\n      pressName\\n      source\\n      shortDescription\\n      fullDescription\\n      bookImgUrl\\n      pressImgUrl\\n      productUrl\\n      __typename\\n    }\\n    isSubscribed\\n    isDailyQuestion\\n    dailyRecordStatus\\n    editorType\\n    ugcQuestionId\\n    style\\n    __typename\\n  }\\n}\\n"}' \
  --compressed

可以看到还是蛮多的请求项的,不过这样太花眼了,转换一下格式吧。

{
  operationName: "questionData"
  query: "query questionData($titleSlug: String!) {
  question(titleSlug: $titleSlug) {
    questionId
    questionFrontendId
    boundTopicId
    title
    titleSlug
    content
    translatedTitle
    translatedContent
    isPaidOnly
    difficulty
    likes
    dislikes
    isLiked
    similarQuestions
    contributors {
      username
      profileUrl
      avatarUrl
      __typename
    }
    langToValidPlayground
    topicTags {
      name
      slug
      translatedName
      __typename
    }
    companyTagStats
    codeSnippets {
      lang
      langSlug
      code
      __typename
    }
    stats
    hints
    solution {
      id
      canSeeDetail
      __typename
    }
    status
    sampleTestCase
    metaData
    judgerAvailable
    judgeType
    mysqlSchemas
    enableRunCode
    envInfo
    book {
      id
      bookName
      pressName
      source
      shortDescription
      fullDescription
      bookImgUrl
      pressImgUrl
      productUrl
      __typename
    }
    isSubscribed
    isDailyQuestion
    dailyRecordStatus
    editorType
    ugcQuestionId
    __typename
  }
}
"
  variables: {titleSlug: "merge-two-sorted-lists"}
}

而转换成 Python 代码的就大概是这个样子

def get_all(slug):
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"
    session = requests.Session()
    url = "https://leetcode-cn.com/graphql"
    params = {
        'operationName':
        "getQuestionDetail",
        'variables': {
            'titleSlug': slug
        },
        'query':
        '''query getQuestionDetail($titleSlug: String!) {
            question(titleSlug: $titleSlug) {
                questionId
                questionFrontendId
                title
                titleSlug
                content
                translatedTitle
                translatedContent
                difficulty
                topicTags {
                    name
                    slug
                    translatedName
                    __typename
                }
                codeSnippets {
                    lang
                    langSlug
                    code
                    __typename
                }
                __typename
            }
        }'''
    }
    json_data = json.dumps(params).encode('utf8')
    headers = {
        'User-Agent': user_agent,
        'Connection': 'keep-alive',
        'Content-Type': 'application/json',
        'Referer': 'https://leetcode-cn.com/problems/' + slug
    }
    resp = session.post(url, data=json_data, headers=headers, timeout=10)
    resp.encoding = 'utf8'
    content = resp.json()
    # 题目详细信息
    # print(content)
    question = content['data']['question']
    return question

把它们格式化成 json 格式的内容即可。而后把响应按 json 进行解析,直接获取 data:question: 里面的项即可。

代码

直接复制粘贴,然后运行就可以啦。哦对了,记得改一下 url

给一个 GitHub Gist
刷不出 GitHub Gist 的话看 gitee

参考

爬虫获取力扣题目信息并转为 Markdown
爬取 LeetCode 题目——如何发送 GraphQL Query 获取数据


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!