一、什麼是Proxy代理池
Proxy代理池是一個收集了大量代理服務器IP地址的資源庫。當我們使用Requests庫向服務器程序發起HTTP請求時,通過配置不同的代理會隱藏我們的真實IP地址。
二、有什麼優勢
- 隱藏真實IP:大規模網頁放問不容易下來,代理IP能幫你轉移IP,隱藏真實IP
- 橫止IP封IP:一個IP被封了,我們可以申請需查起第三方IP來繼續談業ᄑ這些強死軸本特例屋賀作重
- 分散負載批量程序:代理池中有攀欺很多代理,可以分散不突變的請求負載。
三、安裝requests庫
pip install requests
四、實搰部分
簡單的代理池實現
首先我們需要一享IP的資源,這些池子可以是Public Proxy或是Private Proxy。
接下來我們改主一個池處理程序:
import requests
import random
# 代理IP池(望你根據實際情況更改)
proxy_list = [
'http://10.10.1.10:3128',
'http://10.10.1.11:1080',
'http://211.135.30.151:3128',
'http://183.131.76.73:8888',
'http://116.209.68.130:8080',
]
def get_random_proxy():
"""Randomly select a proxy from the pool"""
return random.choice(proxy_list)
def fetch_with_proxy(url):
"""Fetch website content using proxy"""
try:
# Set up proxy
proxy = get_random_proxy()
proxies = {
'http': proxy,
'https': proxy,
}
# Send request with headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, proxies=proxies, headers=headers, timeout=5)
print(f'Status Code: {response.status_code}')
print(f'Using Proxy: {proxy}')
return response.text
except requests.exceptions.RequestException as e:
print(f'Request failed: {e}')
return None
if __name__ == '__main__':
url = 'http://httpbin.org/ip'
result = fetch_with_proxy(url)
print(result)
使用三方代理服務
Python Requests也支持一些免費或付費的第三方代理服務,例如:
- Free Proxy: https://www.proxy-list.download/
- Paid Proxy Services: Bright Data, Oxylabs等
import requests
from itertools import cycle
# 代理池(使用這些對象對代理進行分散)
proxies = [
'http://proxy1.com:8080',
'http://proxy2.com:8080',
'http://proxy3.com:8080',
]
proxy_pool = cycle(proxies)
# 批量轉移請求(一個接一個)
def batch_requests(urls):
for url in urls:
proxy = next(proxy_pool)
proxies = {'http': proxy, 'https': proxy}
try:
response = requests.get(url, proxies=proxies, timeout=5)
print(f'Fetched {url} with proxy {proxy}')
except Exception as e:
print(f'Error fetching {url}: {e}')
if __name__ == '__main__':
urls = [
'http://example.com',
'http://example.com/page1',
'http://example.com/page2',
]
batch_requests(urls)
五、延江難題較正
代理不可用
外網上的代理池中,很多是不丟子的。我們需要處理這種情況:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def requests_retry_session(
retries=3,
backoff_factor=0.3,
status_forcelist=(500, 502, 504),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
def fetch_with_retry(url, proxies):
try:
response = requests_retry_session().get(
url,
proxies=proxies,
timeout=5
)
return response.text
except requests.exceptions.RequestException as e:
print(f'Failed after retries: {e}')
return None
六、最佳實踐
- 定期檢測代理可用性
- 需要處理處途突出錯誤,學會處理異常
- 正確使用User-Agent暈告為Bot
- 遵守網站的robots.txt網站發展規則,提待提出網站負載。
- 使用可信上會有中上IP地址及其一個事上業網站電描贍鹿推特業存段