⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 theartofhttpscripting

📁 harvest是一个下载html网页得机器人
💻
📖 第 1 页 / 共 2 页
字号:
Online:  http://curl.haxx.se/docs/httpscripting.shtmlAuthor:  Daniel Stenberg <daniel@haxx.se>Date:    October 31, 2001Version: 0.5                The Art Of Scripting HTTP Requests Using Curl                ============================================= This document will assume that you're familiar with HTML and general networking. The possibility to write scripts is essential to make a good computer system. Unix' capability to be extended by shell scripts and various tools to run various automated commands and scripts is one reason why it has succeeded so well. The increasing amount of applications moving to the web has made "HTTP Scripting" more frequently requested and wanted. To be able to automatically extract information from the web, to fake users, to post or upload data to web servers are all important tasks today. Curl is a command line tool for doing all sorts of URL manipulations and transfers, but this particular document will focus on how to use it when doing HTTP requests for fun and profit. I'll assume that you know how to invoke 'curl --help' or 'curl --manual' to get basic information about it. Curl is not written to do everything for you. It makes the requests, it gets the data, it sends data and it retrieves the information. You probably need to glue everything together using some kind of script language or repeated manual invokes.1. The HTTP Protocol HTTP is the protocol used to fetch data from web servers. It is a very simple protocol that is built upon TCP/IP. The protocol also allows information to get sent to the server from the client using a few different methods, as will be shown here. HTTP is plain ASCII text lines being sent by the client to a server to request a particular action, and then the server replies a few text lines before the actual requested content is sent to the client. Using curl's option -v will display what kind of commands curl sends to the server, as well as a few other informational texts. -v is the single most useful option when it comes to debug or even understand the curl<->server interaction.2. URL The Uniform Resource Locator format is how you specify the address of a particular resource on the Internet. You know these, you've seen URLs like http://curl.haxx.se or https://yourbank.com a million times.3. GET a page The simplest and most common request/operation made using HTTP is to get a URL. The URL could itself refer to a web page, an image or a file. The client issues a GET request to the server and receives the document it asked for. If you issue the command line        curl http://curl.haxx.se you get a web page returned in your terminal window. The entire HTML document that that URL holds. All HTTP replies contain a set of headers that are normally hidden, use curl's -i option to display them as well as the rest of the document. You can also ask the remote server for ONLY the headers by using the -I option.4. Forms Forms are the general way a web site can present a HTML page with fields for the user to enter data in, and then press some kind of 'OK' or 'submit' button to get that data sent to the server. The server then typically uses the posted data to decide how to act. Like using the entered words to search in a database, or to add the info in a bug track system, display the entered address on a map or using the info as a login-prompt verifying that the user is allowed to see what it is about to see. Of course there has to be some kind of program in the server end to receive the data you send. You cannot just invent something out of the air. 4.1 GET  A GET-form uses the method GET, as specified in HTML like:        <form method="GET" action="junk.cgi">          <input type=text name="birthyear">          <input type=submit name=press value="OK">        </form>  In your favorite browser, this form will appear with a text box to fill in  and a press-button labeled "OK". If you fill in '1905' and press the OK  button, your browser will then create a new URL to get for you. The URL will  get "junk.cgi?birthyear=1905&press=OK" appended to the path part of the  previous URL.  If the original form was seen on the page "www.hotmail.com/when/birth.html",  the second page you'll get will become  "www.hotmail.com/when/junk.cgi?birthyear=1905&press=OK".  Most search engines work this way.  To make curl do the GET form post for you, just enter the expected created  URL:        curl "www.hotmail.com/when/junk.cgi?birthyear=1905&press=OK" 4.2 POST  The GET method makes all input field names get displayed in the URL field of  your browser. That's generally a good thing when you want to be able to  bookmark that page with your given data, but it is an obvious disadvantage  if you entered secret information in one of the fields or if there are a  large amount of fields creating a very long and unreadable URL.  The HTTP protocol then offers the POST method. This way the client sends the  data separated from the URL and thus you won't see any of it in the URL  address field.  The form would look very similar to the previous one:        <form method="POST" action="junk.cgi">          <input type=text name="birthyear">          <input type=submit name=press value="OK">        </form>  And to use curl to post this form with the same data filled in as before, we  could do it like:        curl -d "birthyear=1905&press=OK" www.hotmail.com/when/junk.cgi  This kind of POST will use the Content-Type  application/x-www-form-urlencoded and is the most widely used POST kind. 4.3 FILE UPLOAD POST  Back in late 1995 they defined a new way to post data over HTTP. It was  documented in the RFC 1867, why this method sometimes is referred to as  a RFC1867-posting.  This method is mainly designed to better support file uploads. A form that  allows a user to upload a file could be written like this in HTML:    <form method="POST" enctype='multipart/form-data' action="upload.cgi">      <input type=file name=upload>      <input type=submit name=press value="OK">    </form>  This clearly shows that the Content-Type about to be sent is  multipart/form-data.  To post to a form like this with curl, you enter a command line like:        curl -F upload=@localfilename -F press=OK [URL] 4.4 HIDDEN FIELDS  A very common way for HTML based application to pass state information  between pages is to add hidden fields to the forms. Hidden fields are  already filled in, they aren't displayed to the user and they get passed  along just as all the other fields.  A similar example form with one visible field, one hidden field and one  submit button could look like:    <form method="POST" action="foobar.cgi">      <input type=text name="birthyear">      <input type=hidden name="person" value="daniel">      <input type=submit name="press" value="OK">    </form>  To post this with curl, you won't have to think about if the fields are  hidden or not. To curl they're all the same:        curl -d "birthyear=1905&press=OK&person=daniel" [URL] 4.5 FIGURE OUT WHAT A POST LOOKS LIKE  When you're about fill in a form and send to a server by using curl instead  of a browser, you're of course very interested in sending a POST exactly the  way your browser does.  An easy way to get to see this, is to save the HTML page with the form on  your local disk, modify the 'method' to a GET, and press the submit button  (you could also change the action URL if you want to).  You will then clearly see the data get appended to the URL, separated with a  '?'-letter as GET forms are supposed to.

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -