twisted.web.client.getPage with Base Auth Proxy Support

by mandel on December 15th, 2011

The following is some code in which I have been working (and stupidly wasting time in a small error) that allows to get a page using a methos similar to twisted.web.client.getPage through a proxy that uses base auth.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# -*- coding: utf-8 -*-
#
# Copyright 2011 Canonical Ltd.
#
# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License version 3, as published
# by the Free Software Foundation.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranties of
# MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
# PURPOSE.  See the GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program.  If not, see <http://www.gnu.org/licenses/>.
"""Test the squid test case."""
import base64
 
from twisted.internet import defer, reactor
from twisted.web import client, error, http
 
from ubuntuone.devtools.testcases.squid import SquidTestCase
 
# ignore common twisted lint errors
# pylint: disable=C0103, W0212
 
 
class ProxyClientFactory(client.HTTPClientFactory):
    """Factory that supports proxy."""
 
    def __init__(self, proxy_url, proxy_port, url, headers=None):
        self.proxy_url = proxy_url
        self.proxy_port = proxy_port
        client.HTTPClientFactory.__init__(self, url, headers=headers)
 
    def setURL(self, url):
        self.host = self.proxy_url
        self.port = self.proxy_port
        self.url = url
        self.path = url
 
 
class ProxyWebClient(object):
    """Provide useful web methods with proxy."""
 
    def __init__(self, proxy_url=None, proxy_port=None, username=None,
            password=None):
        """Create a new instance with the proxy settings."""
        self.proxy_url = proxy_url
        self.proxy_port = proxy_port
        self.username = username
        self.password = password
 
    def _process_auth_error(self, failure, url, contextFactory):
        """Process an auth failure."""
        # we try to get the page using the basic auth
        failure.trap(error.Error)
        if failure.value.status == str(http.PROXY_AUTH_REQUIRED):
            auth = base64.b64encode('%s:%s' % (self.username, self.password))
            auth_header = 'Basic ' + auth.strip()
            factory = ProxyClientFactory(self.proxy_url, self.proxy_port, url,
                    headers={'Proxy-Authorization': auth_header})
            # pylint: disable=E1101
            reactor.connectTCP(self.proxy_url, self.proxy_port, factory)
            # pylint: enable=E1101
            return factory.deferred
        else:
            return failure
 
    def get_page(self, url, contextFactory=None, *args, **kwargs):
        """Download a webpage as a string.
 
        This method relies on the twisted.web.client.getPage but adds and extra
        step. If there is an auth error the method will perform a second try
        so that the username and password are used.
        """
        scheme, _, _, _ = client._parse(url)
        factory = ProxyClientFactory(self.proxy_url, self.proxy_port, url)
        if scheme == 'https':
            from twisted.internet import ssl
            if contextFactory is None:
                contextFactory = ssl.ClientContextFactory()
            # pylint: disable=E1101
            reactor.connectSSL(self.proxy_url, self.proxy_port,
                               factory, contextFactory)
            # pylint: enable=E1101
        else:
            # pylint: disable=E1101
            reactor.connectTCP(self.proxy_url, self.proxy_port, factory)
            # pylint: enable=E1101
        factory.deferred.addErrback(self._process_auth_error, url,
                                    contextFactory)
        return factory.deferred

I hope that this helps anyone out there :)

From Canonical

  • claudio

    Hi,
    first thank you for sharing. I tried this code but it does not work.
    It produces a bad request because the resulting request puts in the host filed the hostname of the proxy, and I receive a bad request from the server.
    I am trying to get:
    http://devimages.apple.com/iphone/samples/bipbop/bipbopall.m3u8

    and the generated request is:

    GET /iphone/samples/bipbop/bipbopall.m3u8 HTTP/1.0
    host: http://myproxy.com
    connection: close
    user-agent: Twisted PageGetter

    were http://myproxy.com is my proxy (without authentication). Moreover there is this connection:close which I do not understand.

    while the correct request would have been (if I do not use proxy)

    GET /iphone/samples/bipbop/bipbopall.m3u8 HTTP/1.0
    Host: devimages.apple.com
    User-Agent: Twisted PageGetter

    can you help me on that please? Thank you

  • http://mandel.themacaque.com mandel

    Hello,

    Can you please post the code so that I can take a look. The fact that the host is the proxy is correct because the Get has to go through the proxy and will later be fetch from the real host or maybe from the cache (if for example you are using a cache proxy).

    The request is initially send without auth because that is the way it should be sent until the proxy returns a PROXY_AUTH_ERROR when we will then use the correct auth creds.

    Again, please post the code here and I’ll take a closer look.