百度文庫之文檔解析筆記詳情 - 百度文庫,下載,爬蟲二毛erma0 博客

起因

上次想下載個文檔，試了一圈百度文庫下載器，結果都不能用了。

包括各種軟件和瀏覽器插件、油猴插件，全都不行了。

無奈只能臨時用複製的方法（選中內容，點擊“翻譯”）把內容拿到。

事後有空，索性看看能不能解析下載下來。

過程

先網上搜了一下，有開源的內容，也有分析的文章，試了兩個也都不能用了，所以沒細看。

直接上手，發現直接訪問文章鏈接，網頁內容直接就包括所有需要的東西了，包括文字和圖片的鏈接。

其中圖片鏈接不知道怎麼解析的，看起來像是服務端把多張圖片合併成了一張返回，前端再分割的。

文字內容是json格式返回的，網上搜了一下，有個傳説中的bdjson，也就是百度自定義的json格式，也就是xreader（psp上的閲讀軟件）的格式，簡單查了一下，沒發現xreader轉doc的方法，索性放棄直接轉換，決定直接取文字內容算了，之後再把圖片全部下載，手動調整格式。

可能很多時候只是想複製一點內容，卻發現無法預覽全文，測試發現，先清空文庫的cookie，再從百度搜索界面跳轉文檔，就不限制全文預覽了。

後來又發現一個接口，直接可以預覽全文，可以複製，就是不能打印（只顯示6頁內容，其他部分空白）

https://wenku.baidu.com/share/文檔ID?share_api=1

下載文字和圖片的代碼

import requests
import re, time, os, json


class Doc(object):
    """
    百度文庫下載為TXT
    """
    def __init__(self, url):
        """
        傳入地址
        """
        self.doc_id = re.findall('view/(.*).html', url)[0]
        self.s = requests.session()
        self.get_info()
        self.get_token()
        if not os.path.exists('下載'):
            os.mkdir('下載')

    @staticmethod
    def get_timestamp(long: int = 13):
        """
        取時間戳，默認13位
        """
        return str(time.time_ns())[:long]

    def get_info(self):
        """
        取文檔名稱/頁數等基本信息
        """
        url = f'https://wenku.baidu.com/share/{self.doc_id}?share_api=1'
        try:
            html = self.s.get(url).content.decode('GBK')
            self.title = re.search(r"title'   : '([\s\S]*?)',", html).group(1)
            pages = re.search(r"totalPageNum' : '([\s\S]*?)',", html).group(1)
            self.pages = int(pages)
            self.type = re.search(r"docType' : '(.*?)',", html).group(1)
            htmlURLs = re.search(r"htmlUrls: '([\s\S]*?)',", html).group(1).replace('\\\\', '')
            htmlURLs = htmlURLs.encode('latin1').decode('unicode_escape')
            self.urls = json.loads(htmlURLs)
        except:
            print('獲取文檔基本信息失敗，運行結束。')
            os._exit()

    def get_token(self):
        """
        取Token，短期（小時級）有效
        """
        url = 'https://wenku.baidu.com/api/interface/gettoken?host=wenku.baidu.com&secretKey=6f614cb00c6b6821e3cdc85ab1f8f907'
        try:
            res = self.s.get(url).json()
            self.token = res['data']['token']
        except:
            print('獲取token失敗，運行結束。')
            os._exit()

    def download_pic(self):
        """
        下載所有圖片
        """
        pics = self.urls['png']
        for pic in pics:
            content = self.s.get(pic['pageLoadUrl']).content
            with open(f'下載/{self.title}_{pic["pageIndex"]}.png', 'wb') as f:
                f.write(content)

    def download(self):
        """
        下載文字
        """
        self.download_pic()
        print(self.token, self.title, self.pages, self.type)
        result = ''
        url = 'https://wenku.baidu.com/api/interface/getcontent'
        for page in range(1, self.pages + 1):
            params = {
                "doc_id": self.doc_id,
                "pn": page,
                "t": "json",
                "v": "6",
                "host": "wenku.baidu.com",
                "token": self.token,
                "type": "xreader"
            }
            res = self.s.get(url, params=params).json()
            lst = res['data']['1']['body']
            for index in range(len(lst)):
                if lst[index]['t'] == 'word':
                    text = lst[index]['c']
                    if text == ' ' and index < len(lst) - 1:
                        if lst[index]['p']['y'] < lst[index + 1]['p']['y']:
                            text = '\n'
                        else:
                            text = ' '
                    result += text
                elif lst[index]['t'] == 'pic':
                    pass
        with open(f'下載/{self.title}.txt', 'w', encoding='utf-8') as f:
            f.write(result)
        print('完成')

    def download2(self):
        """
        下載文字
        """
        self.download_pic()
        print(self.token, self.title, self.pages, self.type)
        result = ''
        pages = self.urls['json']
        for page in pages:
            res = self.s.get(page['pageLoadUrl']).text
            res = res[8:-1]
            res = json.loads(res)
            lst = res['body']
            for index in range(len(lst)):
                if lst[index]['t'] == 'word':
                    text = lst[index]['c']
                    if text == ' ' and index < len(lst) - 1:
                        if lst[index]['p']['y'] < lst[index + 1]['p']['y']:
                            text = '\n'
                        else:
                            text = ' '
                    result += text
                elif lst[index]['t'] == 'pic':
                    pass
        with open(f'下載/{self.title}.txt', 'w', encoding='utf-8') as f:
            f.write(result)
        print('完成')


if __name__ == "__main__":

    # url = 'https://wenku.baidu.com/view/81bbd69a541810a6f524ccbff121dd36a22dc477.html'
    url = 'https://wenku.baidu.com/view/c145899e6294dd88d1d26b0e.html'
    doc = Doc(url)
    doc.download()
    # doc.download2()

二毛erma0 博客

二毛erma0 博客

博客 / 詳情

百度文庫之文檔解析筆記

起因

過程

下載文字和圖片的代碼

發佈評論

Product

Company

Support

Company

博客 / 詳情

百度文庫之文檔解析筆記

起因

過程

下載文字和圖片的代碼

發佈 評論

發佈評論