📄 libcurl-tutorial.html

📁 功能最强大的网络爬虫,希望大家好好学习啊,好好研究啊
💻 HTML
📖 第 1 页 / 共 5 页
字号:
 simply output the received data to stdout. You can have the default callback write the data to a different file handle by passing a 'FILE *' to a file opened for writing with the <span Class="emphasis">CURLOPT_WRITEDATA</span> option. <p class="level0">Now, we need to take a step back and have a deep breath. Here's one of those rare platform-dependent nitpicks. Did you spot it? On some platforms[2], libcurl won't be able to operate on files opened by the program. Thus, if you use the default callback and pass in an open file with <span Class="emphasis">CURLOPT_WRITEDATA</span>, it will crash. You should therefore avoid this to make your program run fine virtually everywhere. <p class="level0">(<span Class="emphasis">CURLOPT_WRITEDATA</span> was formerly known as <span Class="emphasis">CURLOPT_FILE</span>. Both names still work and do the same thing). <p class="level0">If you're using libcurl as a win32 DLL, you MUST use the <span Class="emphasis">CURLOPT_WRITEFUNCTION</span> if you set <span Class="emphasis">CURLOPT_WRITEDATA</span> - or you will experience crashes. <p class="level0">There are of course many more options you can set, and we'll get back to a few of them later. Let's instead continue to the actual transfer: <p class="level0">&nbsp;success = curl_easy_perform(easyhandle); <p class="level0"><a class="emphasis" href="./curl_easy_perform.html">curl_easy_perform(3)</a> will connect to the remote site, do the necessary commands and receive the transfer. Whenever it receives data, it calls the callback function we previously set. The function may get one byte at a time, or it may get many kilobytes at once. libcurl delivers as much as possible as often as possible. Your callback function should return the number of bytes it "took care of". If that is not the exact same amount of bytes that was passed to it, libcurl will abort the operation and return with an error code. <p class="level0">When the transfer is complete, the function returns a return code that informs you if it succeeded in its mission or not. If a return code isn't enough for you, you can use the CURLOPT_ERRORBUFFER to point libcurl to a buffer of yours where it'll store a human readable error message as well. <p class="level0">If you then want to transfer another file, the handle is ready to be used again. Mind you, it is even preferred that you re-use an existing handle if you intend to make another transfer. libcurl will then attempt to re-use the previous connection. <p class="level0"></pre><a name="Multi-threading"></a><h2 class="nroffsh">Multi-threading Issues</h2><p class="level0">The first basic rule is that you must <span Class="bold">never</span> share a libcurl handle (be it easy or multi or whatever) between multiple threads. Only use one handle in one thread at a time. <p class="level0">libcurl is completely thread safe, except for two issues: signals and SSL/TLS handlers. Signals are used timeouting name resolves (during DNS lookup) - when built without c-ares support and not on Windows.. <p class="level0">If you are accessing HTTPS or FTPS URLs in a multi-threaded manner, you are then of course using OpenSSL/GnuTLS multi-threaded and those libs have their own requirements on this issue. Basically, you need to provide one or two functions to allow it to function properly. For all details, see this: <p class="level0">OpenSSL <p class="level0">&nbsp;  <a href="http://www.openssl.org/docs/crypto/threads.html">http://www.openssl.org/docs/crypto/threads.html</a>#DESCRIPTION <p class="level0">GnuTLS <p class="level0">&nbsp;<a href="http://www.gnu.org/software/gnutls/manual/html_node/">http://www.gnu.org/software/gnutls/manual/html_node/</a>Multi_002dthreaded-applications.html <p class="level0">When using multiple threads you should set the CURLOPT_NOSIGNAL option to TRUE for all handles. Everything will or might work fine except that timeouts are not honored during the DNS lookup - which you can work around by building libcurl with c-ares support. c-ares is a library that provides asynchronous name resolves. Unfortunately, c-ares does not yet fully support IPv6. On some platforms, libcurl simply will not function properly multi-threaded unless this option is set. <p class="level0">Also, note that CURLOPT_DNS_USE_GLOBAL_CACHE is not thread-safe. <p class="level0"><a name="When"></a><h2 class="nroffsh">When It Doesn't Work</h2><p class="level0">There will always be times when the transfer fails for some reason. You might have set the wrong libcurl option or misunderstood what the libcurl option actually does, or the remote server might return non-standard replies that confuse the library which then confuses your program. <p class="level0">There's one golden rule when these things occur: set the CURLOPT_VERBOSE option to TRUE. It'll cause the library to spew out the entire protocol details it sends, some internal info and some received protocol data as well (especially when using FTP). If you're using HTTP, adding the headers in the received output to study is also a clever way to get a better understanding why the server behaves the way it does. Include headers in the normal body output with CURLOPT_HEADER set TRUE. <p class="level0">Of course there are bugs left. We need to get to know about them to be able to fix them, so we're quite dependent on your bug reports! When you do report suspected bugs in libcurl, please include as much details you possibly can: a protocol dump that CURLOPT_VERBOSE produces, library version, as much as possible of your code that uses libcurl, operating system name and version, compiler name and version etc. <p class="level0">If CURLOPT_VERBOSE is not enough, you increase the level of debug data your application receive by using the CURLOPT_DEBUGFUNCTION. <p class="level0">Getting some in-depth knowledge about the protocols involved is never wrong, and if you're trying to do funny things, you might very well understand libcurl and how to use it better if you study the appropriate RFC documents at least briefly. <p class="level0"><a name="Upload"></a><h2 class="nroffsh">Upload Data to a Remote Site</h2><p class="level0">libcurl tries to keep a protocol independent approach to most transfers, thus uploading to a remote FTP site is very similar to uploading data to a HTTP server with a PUT request. <p class="level0">Of course, first you either create an easy handle or you re-use one existing one. Then you set the URL to operate on just like before. This is the remote URL, that we now will upload. <p class="level0">Since we write an application, we most likely want libcurl to get the upload data by asking us for it. To make it do that, we set the read callback and the custom pointer libcurl will pass to our read callback. The read callback should have a prototype similar to: <p class="level0">&nbsp;size_t function(char *bufptr, size_t size, size_t nitems, void *userp); <p class="level0">Where bufptr is the pointer to a buffer we fill in with data to upload and size*nitems is the size of the buffer and therefore also the maximum amount of data we can return to libcurl in this call. The 'userp' pointer is the custom pointer we set to point to a struct of ours to pass private data between the application and the callback. <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_READFUNCTION, read_function); <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_INFILE, &filedata); <p class="level0">Tell libcurl that we want to upload: <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_UPLOAD, TRUE); <p class="level0">A few protocols won't behave properly when uploads are done without any prior knowledge of the expected file size. So, set the upload file size using the CURLOPT_INFILESIZE_LARGE for all known file sizes like this[1]: <p class="level0"><pre><p class="level0">&nbsp;/* in this example, file_size must be an off_t variable */ &nbsp;curl_easy_setopt(easyhandle, CURLOPT_INFILESIZE_LARGE, file_size); <p class="level0">When you call <a class="emphasis" href="./curl_easy_perform.html">curl_easy_perform(3)</a> this time, it'll perform all the necessary operations and when it has invoked the upload it'll call your supplied callback to get the data to upload. The program should return as much data as possible in every invoke, as that is likely to make the upload perform as fast as possible. The callback should return the number of bytes it wrote in the buffer. Returning 0 will signal the end of the upload. <p class="level0"></pre><a name="Passwords"></a><h2 class="nroffsh">Passwords</h2><p class="level0">Many protocols use or even require that user name and password are provided to be able to download or upload the data of your choice. libcurl offers several ways to specify them. <p class="level0">Most protocols support that you specify the name and password in the URL itself. libcurl will detect this and use them accordingly. This is written like this: <p class="level0">&nbsp;protocol://user:password@example.com/path/ <p class="level0">If you need any odd letters in your user name or password, you should enter them URL encoded, as %XX where XX is a two-digit hexadecimal number. <p class="level0">libcurl also provides options to set various passwords. The user name and password as shown embedded in the URL can instead get set with the CURLOPT_USERPWD option. The argument passed to libcurl should be a char * to a string in the format "user:password:". In a manner like this: <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_USERPWD, "myname:thesecret"); <p class="level0">Another case where name and password might be needed at times, is for those users who need to authenticate themselves to a proxy they use. libcurl offers another option for this, the CURLOPT_PROXYUSERPWD. It is used quite similar to the CURLOPT_USERPWD option like this: <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_PROXYUSERPWD, "myname:thesecret"); <p class="level0">There's a long time unix "standard" way of storing ftp user names and passwords, namely in the $HOME/.netrc file. The file should be made private so that only the user may read it (see also the "Security Considerations" chapter), as it might contain the password in plain text. libcurl has the ability to use this file to figure out what set of user name and password to use for a particular host. As an extension to the normal functionality, libcurl also supports this file for non-FTP protocols such as HTTP. To make curl use this file, use the CURLOPT_NETRC option: <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_NETRC, TRUE); <p class="level0">And a very basic example of how such a .netrc file may look like: <p class="level0"><pre><p class="level0">&nbsp;machine myhost.mydomain.com &nbsp;login userlogin &nbsp;password secretword <p class="level0">All these examples have been cases where the password has been optional, or at least you could leave it out and have libcurl attempt to do its job without it. There are times when the password isn't optional, like when you're using an SSL private key for secure transfers. <p class="level0">To pass the known private key password to libcurl: <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_SSLKEYPASSWD, "keypassword"); <p class="level0"></pre><a name="HTTP"></a><h2 class="nroffsh">HTTP Authentication</h2><p class="level0">The previous chapter showed how to set user name and password for getting URLs that require authentication. When using the HTTP protocol, there are many different ways a client can provide those credentials to the server and you can control what way libcurl will (attempt to) use. The default HTTP authentication method is called 'Basic', which is sending the name and password in clear-text in the HTTP request, base64-encoded. This is insecure. <p class="level0">At the time of this writing libcurl can be built to use: Basic, Digest, NTLM, Negotiate, GSS-Negotiate and SPNEGO. You can tell libcurl which one to use with CURLOPT_HTTPAUTH as in: <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_HTTPAUTH, CURLAUTH_DIGEST); <p class="level0">And when you send authentication to a proxy, you can also set authentication type the same way but instead with CURLOPT_PROXYAUTH: <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_PROXYAUTH, CURLAUTH_NTLM); <p class="level0">Both these options allow you to set multiple types (by ORing them together), to make libcurl pick the most secure one out of the types the server/proxy claims to support. This method does however add a round-trip since libcurl must first ask the server what it supports: <p class="level0">&nbsp;curl_easy_setopt(easyhandle, CURLOPT_HTTPAUTH, &nbsp;CURLAUTH_DIGEST|CURLAUTH_BASIC); <p class="level0">For convenience, you can use the 'CURLAUTH_ANY' define (instead of a list with specific types) which allows libcurl to use whatever method it wants. <p class="level0">When asking for multiple types, libcurl will pick the available one it considers "best" in its own internal order of preference. <p class="level0"><a name="HTTP"></a><h2 class="nroffsh">HTTP POSTing</h2><p class="level0">We get many questions regarding how to issue HTTP POSTs with libcurl the proper way. This chapter will thus include examples using both different versions of HTTP POST that libcurl supports. <p class="level0">The first version is the simple POST, the most common version, that most HTML pages using the &lt;form&gt; tag uses. We provide a pointer to the data and tell libcurl to post it all to the remote site: <p class="level0"><pre><p class="level0">&nbsp;   char *data="name=daniel&project=curl"; &nbsp;   curl_easy_setopt(easyhandle, CURLOPT_POSTFIELDS, data); &nbsp;   curl_easy_setopt(easyhandle, CURLOPT_URL, "<a href="http://posthere.com/">http://posthere.com/</a>"); <p class="level0">&nbsp;   curl_easy_perform(easyhandle); /* post away! */ <p class="level0">Simple enough, huh? Since you set the POST options with the CURLOPT_POSTFIELDS, this automatically switches the handle to use POST in the upcoming request.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -