虽然使用Java自带的URLConnection能处理一些基本的连接、下载、关闭等功能，但是很多的高级功能比如html伪装浏览器、标记清除。所以笔者采用——
HttpClient 4.5.2——模拟http请求；下载点我、点我、点我
Chrome(F12 开发者模式)——目标页面结构分析；
Jsoup1.9.2——HTML解析；下载点我、点我、点我
来重新爬取知乎-发现页面上的提问信息。形如：
图中红色区域，就是提问信息，也就是带问号的。

基本思路

“知乎脚丫”爬虫的基本思路是：主线程模拟浏览器GET请求到种子URL -> 将网页中所有提问信息的链接地址取回来 -> 开启爬虫线程对获取到的叶子URL再次GET请求->对页面的内容进行解析、获取目标数据保存到相应的存储(可以是数据库或者文件)。
当然啦，以上是建立在友好的环境下，也就是说爬取过程无需模拟登录、被爬的网站比较善良不会做一些“反爬”的工作的基础上，如果真滴被“反爬”了，可以动态切换帐号/IP
访问时间dealy等。
所以程序中，笔者也留下了这两部分工作，待之后get新技能。

程序架构

模拟登陆：这部分没有做，比较复杂，之后会继续做；
初始化队列：存放所有待爬取的叶子URL；
Fetcher: 爬虫模拟浏览器发出GET URL请求，下载页面；
Handler:对Fetcher下载的页面进行初步处理，如判断该页面的返回状态码是否正确、页面内容是否为反爬信息等，从而保证传到Parser进行解析的页面是正确的;
Parser：对Fetcher下载的页面内容进行解析，获取叶子链接，获取目标数据；
Store: 将Parser解析出的目标数据存入本地存储，可以是MySQL传统数据库，也可以文件存储

程序部分代码

首先要实现两个类似于队列的数据结构，用于放置待爬取URL和已爬取的URL，这个比较easy就不上代码了。

Fetcher模块

模拟浏览器GET HTTP请求

import java.io.IOException;
import java.util.logging.Logger;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.StatusLine;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpResponseException;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import comSpider.model.FetchedPage;
import comSpider.queue.UrlQueue;

public class PageFetcher {
	private static final Logger Log = Logger.getLogger(PageFetcher.class.getName());
	
	/**
	 * 创建HttpClient实例，并初始化连接参数
	 */
	private static CloseableHttpClient client=HttpClients.createDefault();
	/**
	 * 主动关闭HttpClient连接
	 */
	public static void close(){
		try {
			client.close();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
	
	/**
	 * 根据url爬取网页内容
	 * @param url
	 * @return FetchedPage
	 */
	public  static FetchedPage getContentFromUrl(String url){
		FetchedPage fetchedPage=null;
		
		// 创建Get请求，并设置Header
		HttpGet getHttp = new HttpGet(url);	
		System.out.println(getHttp.getRequestLine());
		
		//创建响应处理器ResponseHandler,一定要重写handleResponse(HttpResponse response)方法
		ResponseHandler<FetchedPage> responseHandler = new ResponseHandler<FetchedPage>(){

			@Override
			public FetchedPage handleResponse(final HttpResponse response) throws ClientProtocolException, IOException {
				// TODO Auto-generated method stub
				StatusLine statusLine=response.getStatusLine();//响应的状态行
				int statusCode=statusLine.getStatusCode();//状态码
				//System.out.println(statusCode);
				HttpEntity entity=response.getEntity();//响应的实体
				//如果状态码>=300,抛出响应异常
				if(statusCode>=300){
					throw new HttpResponseException(
		                    statusCode,
		                    statusLine.getReasonPhrase());
				}
				//如果实体为空，抛出异常
		        if (entity == null) {
		            throw new ClientProtocolException("Response contains no content");
		        }
		        // 转化为文本信息, 设置爬取网页的字符集，防止乱码
		        String content = EntityUtils.toString(entity, "UTF-8");
		        //System.out.println(content);
				return new FetchedPage(url,content,statusCode);
			}
		};
		
		//建立连接，并响应
		try{
			fetchedPage=client.execute(getHttp, responseHandler);
		}
		catch(Exception e){
			e.printStackTrace();
			
			//因请求超时等问题产生的异常，将URL放回待抓取队列，重新爬取
			Log.info(">> Put back url: " + url);
			UrlQueue.addFirstElement(url);
		}
		//return FetchedPage(url, content, statusCode);
		return fetchedPage;
	}
	
}

其中请求返回的HttpEntity到底是什么？笔者打印出来如下：

HttpResponseProxy{HTTP/1.1 200 OK [Server: Qnginx/1.1.2, Date: Thu, 21 Jul 2016 12:11:54 GMT, Content-Type: text/html; charset=UTF-8, Connection: keep-alive, Last-Modified: Thu, 21 Jul 2016 12:03:51 GMT, X-Za-Response-Id: dc09fd95f941415c8bea6a277914da97, Content-Security-Policy: default-src *; img-src * data:; frame-src 'self' *.zhihu.com getpocket.com note.youdao.com; script-src 'self' *.zhihu.com *.google-analytics.com zhstatic.zhihu.com res.wx.qq.com 'unsafe-eval'; style-src 'self' *.zhihu.com 'unsafe-inline', Pragma: no-cache, X-Frame-Options: DENY, X-Cache-Status: HIT, Accept-Ranges: bytes, X-Req-ID: 29223CA15790BC0D, Vary: Accept-Encoding, X-NWS-LOG-UUID: 44868d3b-2c6d-47e9-aaaa-59f0a09f90f7] org.apache.http.client.entity.DecompressingEntity@dbd940d}

认识几个常用的字段，HTTP/1.1-http协议；200-返回的状态码;OK-代表连接成功；[Server:.....]-连接的服务器相关信息；org.apache.http.client.entity.DecompressingEntity@dbd940d-这里面才是网页的真实内容，不过需要转化成文本信息。

Handler模块

对“反爬”和连接不成功的的解决处理，先这么着。
HTTP状态码总结——

典型的错误包含”404”(页面无法找到)，”403”(请求禁止)，和”401”(带验证请求)。
HTTP状态码表示HTTP协议所返回的响应的状态。
比如客户端向服务器发送请求，如果成功地获得请求的资源，则返回的状态码为200，表示响应成功。如果请求的资源不存在，则通常返回404错误。
HTTP状态码通常分为5种类型，分别以1～5五个数字开头，由3位整数组成：
200：请求成功处理方式：获得响应的内容，进行处理
201：请求完成，结果是创建了新资源。新创建资源的URI可在响应的实体中得到处理方式：爬虫中不会遇到
202：请求被接受，但处理尚未完成处理方式：阻塞等待 204：服务器端已经实现了请求，但是没有返回新的信
息。如果客户是用户代理，则无须为此更新自身的文档视图。处理方式：丢弃 300：该状态码不被HTTP/1.0的应用程序直接使用，
只是作为3XX类型回应的默认解释。存在多个可用的被请求资源。处理方式：若程序中能够处理，则进行进一步处理，如果程序中不能处理，则丢弃
301：请求到的资源都会分配一个永久的URL，这样就可以在将来通过该URL来访问此资源处理方式：重定向到分配的URL
302：请求到的资源在一个不同的URL处临时保存处理方式：重定向到临时的URL 304 请求的资源未更新处理方式：丢弃
400 非法请求处理方式：丢弃 401 未授权处理方式：丢弃 403 禁止处理方式：丢弃 404
没有找到处理方式：丢弃 5XX 回应代码以“5”开头的状态码表示服务器端发现自己出现错误，不能继续执行请求处理方式：丢弃

import comSpider.model.FetchedPage;
import comSpider.queue.UrlQueue;

public class ContentHandler {
	public boolean check(FetchedPage fetchedPage){
		// 如果抓取的页面包含反爬取内容，则将当前URL放入待爬取队列，以便重新爬取
		if(isAntiScratch(fetchedPage)){
			UrlQueue.addFirstElement(fetchedPage.getUrl());
			return false;
		}
		
		return true;
	}
	
	//通过状态码判断该网页回应是否正确
	private boolean isStatusValid(int statusCode){
		if(statusCode >= 200 && statusCode < 400){
			return true;
		}
		return false;
	}
	//是否包含反爬内容
	private boolean isAntiScratch(FetchedPage fetchedPage){
		// 403 forbidden
		if((!isStatusValid(fetchedPage.getStatusCode())) && fetchedPage.getStatusCode() == 403){
			return true;
		}
		
		// 页面内容包含的反爬取内容
		if(fetchedPage.getContent().contains("<div>禁止访问</div>")){
			return true;
		}
		
		return false;
	}
}

Parser模块

这个解析模块有两个函数，parseLinks()用于解析种子页面获得所有的叶子链接地址，parse()用于解析叶子页面获得目标数据。
Note:之前也有提到过提问的回答地址必须是绝对地址。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import comSpider.model.FetchedPage;
import comSpider.queue.UrlQueue;
import comSpider.queue.VisitedUrlQueue;
public class ContentParser {
	/**
	 * 解析提问页面内容，并把目标数据封装成ZhihuQuestions
	 * @param questionPage
	 * @return
	 */
	public synchronized ZhihuQuestions parse(FetchedPage questionPage){
		ZhihuQuestions targetZhihu =null;
		//抓取每一个提问链接内的目标数据，封装成ZhihuQuestions类
		targetZhihu=new ZhihuQuestions(questionPage.getUrl());
		// 把已经爬过的URL放入已爬取队列
		VisitedUrlQueue.addElement(questionPage.getUrl());
		
		// 根据当前页面和URL获取下一步爬取的URLs
		// TODO
		
		return targetZhihu; 
	}
	/**
	 * 解析出知乎发现首页上的所有提问的链接地址，加入待爬取队列UrlQueue容器里
	 * @param fetchedPage
	 */
	public void parseLinks(FetchedPage fetchedPage){
		//ArrayList<String> urlArr=new ArrayList<String>();
		//创建一个干净的html文档
		Document doc = Jsoup.parse(fetchedPage.getContent());
		//class等于question_link的a标签的集合,也就是提问的相对地址
		Elements links = doc.select("a.question_link");
		for(Element link:links){
			String linkHref = link.attr("href");//获取链接的相对路径
			String nextLink="https://www.zhihu.com"+linkHref;//绝对路径
			UrlQueue.addElement(nextLink);
		}
	}
}

Store模块

把爬取到的目标数据写入文件中。

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

import comSpider.parser.ZhihuQuestions;

public class DataStorage {
	private File dir;
	private File file;
	static volatile boolean isCreatedSucceed=false;
	FileWriter fw;
	public DataStorage(){
		createDirFile();
		try {
			fw=new FileWriter(file,true);
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
	
	/**
	 * 存储所有提问到本地文件out/questions.txt
	 * @param data
	 * @return
	 */
	public synchronized void store(ZhihuQuestions data){
		// store to DB
		// TODO
		
		
		//store to File
		//创建待存储文本
		//createDirFile();
		//写数据到文件中
		try{
			//多个线程各自开启一个FileWriter写入同一个文件,会产生串行现象
			BufferedWriter writer=new BufferedWriter(fw);//在构造FileWriter时，第二个参数设置为true
			writer.write(data.toString());				
			writer.flush();
			//writer.close();
			//System.out.println("had stored");
		}catch(Exception e){
			e.printStackTrace();
		}
	}
	/**
	 * 创建一个out/questions.txt文件
	 * @param 
	 * @return
	 */
	private void createDirFile(){	
		try{
			//只创建一次文件,没有则创建，有则不创建
			if(!isCreatedSucceed){
				//创建文件夹
				dir=new File("out");
				dir.mkdirs();
				//再创建文件
				file=new File("out/questions.txt");//创建抽象路径名
				file.createNewFile();
				isCreatedSucceed=true;//已经创建文件
				System.out.println("create \"out/question.txt\" file succeed." );
			}
//			else
//				System.out.println("create \"out/question.txt\" file failed." );

		}catch(Exception e){
			e.printStackTrace();
		}
	}
}

结果

整个项目已经放到github。