📄 libcurl-the-guide

📁 harvest是一个下载html网页得机器人
💻
📖 第 1 页 / 共 4 页
字号:
 libcurl has full support for HTTP proxies, so when a given URL is wanted, libcurl will ask the proxy for it instead of trying to connect to the actual host identified in the URL. The fact that the proxy is a HTTP proxy puts certain restrictions on what can actually happen. A requested URL that might not be a HTTP URL will be still be passed to the HTTP proxy to deliver back to libcurl. This happens transparantly, and an application may not need to know. I say "may", because at times it is very important to understand that all operations over a HTTP proxy is using the HTTP protocol. For example, you can't invoke your own custom FTP commands or even proper FTP directory listings.  Proxy Options    To tell libcurl to use a proxy at a given port number:       curl_easy_setopt(easyhandle, CURLOPT_PROXY, "proxy-host.com:8080");    Some proxies require user authentication before allowing a request, and    you pass that information similar to this:       curl_easy_setopt(easyhandle, CURLOPT_PROXYUSERPWD, "user:password");    If you want to, you can specify the host name only in the CURLOPT_PROXY    option, and set the port number separately with CURLOPT_PROXYPORT.  Environment Variables    libcurl automaticly checks and uses a set of environment variables to know    what proxies to use for certain protocols. The names of the variables are    following an ancient de facto standard and are built up as    "[protocol]_proxy" (note the lower casing). Which makes the variable    'http_proxy' checked for a name of a proxy to use when the input URL is    HTTP. Following the same rule, the variable named 'ftp_proxy' is checked    for FTP URLs. Again, the proxies are always HTTP proxies, the different    names of the variables simply allows different HTTP proxies to be used.    The proxy environment variable contents should be in the format    "[protocol://]machine[:port]". Where the protocol:// part is simply    ignored if present (so http://proxy and bluerk://proxy will do the same)    and the optional port number specifies on which port the proxy operates on    the host. If not specified, the internal default port number will be used    and that is most likely *not* the one you would like it to be.    There are two special environment variables. 'all_proxy' is what sets    proxy for any URL in case the protocol specific variable wasn't set, and    'no_proxy' defines a list of hosts that should not use a proxy even though    a variable may say so. If 'no_proxy' is a plain asterisk ("*") it matches    all hosts.  SSL and Proxies    SSL is for secure point-to-point connections. This involves strong    encryption and similar things, which effectivly makes it impossible for a    proxy to operate as a "man in between" which the proxy's task is, as    previously discussed. Instead, the only way to have SSL work over a HTTP    proxy is to ask the proxy to tunnel trough everything without being able    to check or fiddle with the traffic.    Opening an SSL connection over a HTTP proxy is therefor a matter of asking    the proxy for a straight connection to the target host on a specified    port. This is made with the HTTP request CONNECT. ("please mr proxy,    connect me to that remote host").    Because of the nature of this operation, where the proxy has no idea what    kind of data that is passed in and out through this tunnel, this breaks    some of the very few advantages that come from using a proxy, such as    caching.  Many organizations prevent this kind of tunneling to other    destination port numbers than 443 (which is the default HTTPS port    number).  Tunneling Through Proxy    As explained above, tunneling is required for SSL to work and often even    restricted to the operation intended for SSL; HTTPS.    This is however not the only time proxy-tunneling might offer benefits to    you or your application.    As tunneling opens a direct connection from your application to the remote    machine, it suddenly also re-introduces the ability to do non-HTTP    operations over a HTTP proxy. You can in fact use things such as FTP    upload or FTP custom commands this way.    Again, this is often prevented by the adminstrators of proxies and is    rarely allowed.    Tell libcurl to use proxy tunneling like this:       curl_easy_setopt(easyhandle, CURLOPT_HTTPPROXYTUNNEL, TRUE);    In fact, there might even be times when you want to do plain HTTP    operations using a tunnel like this, as it then enables you to operate on    the remote server instead of asking the proxy to do so. libcurl will not    stand in the way for such innovative actions either!  Proxy Auto-Config    Netscape first came up with this. It is basicly a web page (usually using    a .pac extension) with a javascript that when executed by the browser with    the requested URL as input, returns information to the browser on how to    connect to the URL. The returned information might be "DIRECT" (which    means no proxy should be used), "PROXY host:port" (to tell the browser    where the proxy for this particular URL is) or "SOCKS host:port" (to    direct the brower to a SOCKS proxy).    libcurl has no means to interpret or evaluate javascript and thus it    doesn't support this. If you get yourself in a position where you face    this nasty invention, the following advice have been mentioned and used in    the past:    - Depending on the javascript complexity, write up a script that      translates it to another language and execute that.    - Read the javascript code and rewrite the same logic in another language.    - Implement a javascript interpreted, people have successfully used the      Mozilla javascript engine in the past.    - Ask your admins to stop this, for a static proxy setup or similar.Persistancy Is The Way to Happiness Re-cycling the same easy handle several times when doing multiple requests is the way to go. After each single curl_easy_perform() operation, libcurl will keep the connection alive and open. A subsequent request using the same easy handle to the same host might just be able to use the already open connection! This reduces network impact a lot. Even if the connection is dropped, all connections involving SSL to the same host again, will benefit from libcurl's session ID cache that drasticly reduces re-connection time. FTP connections that are kept alive saves a lot of time, as the command- response roundtrips are skipped, and also you don't risk getting blocked without permission to login again like on many FTP servers only allowing N persons to be logged in at the same time. libcurl caches DNS name resolving results, to make lookups of a previously looked up name a lot faster. Other interesting details that improve performance for subsequent requests may also be added in the future. Each easy handle will attempt to keep the last few connections alive for a while in case they are to be used again. You can set the size of this "cache" with the CURLOPT_MAXCONNECTS option. Default is 5. It is very seldom any point in changing this value, and if you think of changing this it is often just a matter of thinking again. When the connection cache gets filled, libcurl must close an existing connection in order to get room for the new one. To know which connection to close, libcurl uses a "close policy" that you can affect with the CURLOPT_CLOSEPOLICY option. There's only two polices implemented as of this writing (libcurl 7.9.4) and they are:  CURLCLOSEPOLICY_LEAST_RECENTLY_USED simply close the one that hasn't been  used for the longest time. This is the default behavior.  CURLCLOSEPOLICY_OLDEST closes the oldest connection, the one that was  createst the longest time ago. There are, or at least were, plans to support a close policy that would call a user-specified callback to let the user be able to decide which connection to dump when this is necessary and therefor is the CURLOPT_CLOSEFUNCTION an existing option still today. Nothing ever uses this though and this will not be used within the forseeable future either. To force your upcoming request to not use an already existing connection (it will even close one first if there happens to be one alive to the same host you're about to operate on), you can do that by setting CURLOPT_FRESH_CONNECT to TRUE. In a similar spirit, you can also forbid the upcoming request to be "lying" around and possibly get re-used after the request by setting CURLOPT_FORBID_REUSE to TRUE.HTTP Headers Used by libcurl When you use libcurl to do HTTP requeests, it'll pass along a series of headers automaticly. It might be good for you to know and understand these ones.  Host    This header is required by HTTP 1.1 and even many 1.0 servers and should    be the name of the server we want to talk to. This includes the port    number if anything but default.  Pragma    "no-cache". Tells a possible proxy to not grap a copy from the cache but    to fetch a fresh one.  Accept:    "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*". Cloned from a    browser once a hundred years ago.  Expect:    When doing multi-part formposts, libcurl will set this header to    "100-continue" to ask the server for an "OK" message before it proceeds    with sending the data part of the post.Customizing Operations There is an ongoing development today where more and more protocols are built upon HTTP for transport. This has obvious benefits as HTTP is a tested and reliable protocol that is widely deployed and have excellent proxy-support. When you use one of these protocols, and even when doing other kinds of programming you may need to change the traditional HTTP (or FTP or...) manners. You may need to change words, headers or various data. libcurl is your friend here too.  CUSTOMREQUEST    If just changing the actual HTTP request keyword is what you want, like    when GET, HEAD or POST is not good enough for you, CURLOPT_CUSTOMREQUEST    is there for you. It is very simple to use:       curl_easy_setopt(easyhandle, CURLOPT_CUSTOMREQUEST, "MYOWNRUQUEST");    When using the custom request, you change the request keyword of the    actual request you are performing. Thus, by default you make GET request    but you can also make a POST operation (as described before) and then    replace the POST keyword if you want to. You're the boss.  Modify Headers    HTTP-like protocols pass a series of headers to the server when doing the    request, and you're free to pass any amount of extra headers that you    think fit. Adding headers are this easy:       struct curl_slist *headers=NULL; /* init to NULL is important */       headers = curl_slist_append(headers, "Hey-server-hey: how are you?");       headers = curl_slist_append(headers, "X-silly-content: yes");       /* pass our list of custom made headers */       curl_easy_setopt(easyhandle, CURLOPT_HTTPHEADER, headers);       curl_easy_perform(easyhandle); /* transfer http */       curl_slist_free_all(headers); /* free the header list */   ... and if you think some of the internally generated headers, such as   Accept: or Host: don't contain the data you want them to contain, you can   replace them by simply setting them too:       headers = curl_slist_append(headers, "Accept: Agent-007");       headers = curl_slist_append(headers, "Host: munged.host.line");  Delete Headers    If you replace an existing header with one with no contents, you will    prevent the header from being sent. Like if you want to completely prevent    the "Accept:" header to be sent, you can disable it with code similar to    this:       headers = curl_slist_append(headers, "Accept:");    Both replacing and cancelling internal headers should be done with careful    consideration and you should be aware that you may violate the HTTP    protocol when doing so.  Enforcing chunked transfer-encoding    By making sure a request uses the custom header "Transfer-Encoding:    chunked" when doing a non-GET HTTP operation, libcurl will switch over to
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -