php写的爬虫，爬某博客所有文章和图片

最近在学JavaScript和python。有一天，偶的一个在读大研究生盆友，跟我讨论python爬虫，我让他教教我py做gui程序，因为我对web的登陆机制比较熟悉，他要爬有登陆认证的站点，于是就探讨了一下。昨晚，于是就想到用php来写一只小爬虫。好吧，我承认，我只是想做seo，想做一个垃圾站群…

其实，这个站点分享的破解软件真的很给力，我用的很多软件都是从这里下载下来的，所以真心觉着不错，于是就想到把他爬下来，然后用crontab定期每天爬，给我feehi.com也导一点流量来。好吧，我承认，这样做有那么一丢丢不道德，不过，我是带着研究学习的心态哒…

php爬虫

真心无奈，php没有多线程，真的要等很久很久很久，才能把这么一个3k篇左右文章的站点爬完（带图片下载及数据入库）。没办法，我只对php更熟一些，py基本的api都不熟，多线程一时半会也搞不定，于是首次研究爬虫还是用php来。为了练练我的正则（貌似没怎么用，平时为了效率，都是用内置的函数，主要是匹配的复杂度不需要正则），这次就没有用扩展库来查找dom节点，而是手动分析html源码，并手写正则来抓取。好在，这个站点没有用ajax，不然就要麻烦一些咯。

爬虫

日志

上代码： 源码下载 <?php
error_reporting(0);
set_time_limit(0);//取消超时限制
date_default_timezone_set(‘PRC’);
define(‘DB_HOST’,’localhost’);
define(‘DB_USER’,’root’);
define(‘DB_PASSWORD’,’xxx’);
$logTxt = date(‘Y-m-d H-i-s’).’.txt';
$log = “抓取开始\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
$log = “连接数据库…\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
$conn = mysql_connect(DB_HOST,DB_USER,DB_PASSWORD);
if($conn){
$log = “数据库连接成功\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
}else{
$log = “数据库连接失败，程序退出\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
exit();
}
mysql_select_db(‘soft’);
mysql_query(“set names utf8″);
for($i=0;$i<154;$i++){
$num = $i+1;
$log = “正在分析{$num}页\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
$url = ‘http://www.guofs.com/page/’.$num;
$content = file_get_contents($url);
preg_match_all(‘/<h2>\s*<a href=”(.*)”/U’,$content, $matches);
$log = “第{$num}页找到”.count($matches[1]).”篇文章\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
preg_match_all(‘/id=\”customImg\”\s+class=\”customImg\”><a\s+href=\”.*\”><img\s+src=\”(.*)\”/U’,$content,$matchesThumb);
$thumbPic = array();
foreach($matchesThumb[1] as $ThumbK => $ThumbV){
$log = “正在下载第{$num}页的第”.($ThumbK+1).”张缩略图\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
$dataThumb = file_get_contents($ThumbV);
$infoThumb = pathinfo($ThumbV);
$pathThumb = ‘/thumb/’.date(‘Y-m-d’).’/';
$filePathThumb = dirname(__FILE__).$pathThumb;
if(!is_dir($filePathThumb)){
mkdir($filePathThumb,0777,true);
}
$rand = rand(0,10000).’_';
$filePathThumb .= $rand.urlencode($infoThumb[‘basename’]);
$thumbPic[] = $pathThumb.$rand.urlencode($infoThumb[‘basename’]);
$fp = @fopen($filePathThumb,’w’);
@fwrite($fp,$dataThumb);
$log = “第{$num}页的第”.($ThumbK+1).”张缩略图下载完成\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
}
foreach($matches[1] as $k => $v){
mysql_query(“update ttt set checked_times=checked_times+1″);
$log = “正在分析{$num}页第”.($k+1).”篇文章…\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
if(is_array($row=mysql_fetch_assoc(mysql_query(“select * from ttt where url=’$v'”)))){
$log = “{$v}在”.date(‘Y-m-d H:i:s’,$row[‘created_at’]).’已经抓取过了，本次未抓取。';
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
continue;
}
$content2 = file_get_contents($v);//echo $content2;die;
preg_match(‘/class=\”postTitle\”>\s*<h1>(.*)<\/h1>/U’,$content2,$matches2);
$title = $matches2[1];
$temp = iconv(‘utf-8′,’GB2312′,$title);
$log = “抓取{$v}成功。标题:{$title}…\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
preg_match(‘/class=\”entry\”>([\S\s]+)<div/U’,$content2,$matches2);
$article = $matches2[1];
preg_match_all(‘/<img[\s\S]*src=\”(.*)\”/U’,$article,$pics);
foreach($pics[1] as $k2 => $v2){
$log = ‘本页包含’.count($pics[1]).”张图，正在下载第”.($k2+1).”张…\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
$data = file_get_contents($v2);
$info = pathinfo($v2);
$path = ‘/uploads/’.date(‘Y-m-d’).’/';
$filePath = dirname(__FILE__).$path;
if(!is_dir($filePath)){
mkdir($filePath,0777,true);
}
$rand_pic = rand(0,10000).’_';
$filePath .= $rand_pic.urlencode($info[‘basename’]);
$fp = @fopen($filePath,’w’);
@fwrite($fp,$data);
$log = “下载第”.($k2+1).”张图片成功…\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
$article = str_replace($v2,$path.$rand_pic.urlencode($info[‘basename’]),$article);
$time = time();
}
$article = str_replace(array(‘www.guofs.com’,’独木成林’),array(‘soft.feehi.com’,’飞嗨’),$article);
mysql_query(“insert into ttt(title,content,thumb,created_at,url) values(‘$title’,’$article’,'{$thumbPic[$k]}’,$time,’$v’)”);
$log = “{$v}入库成功…\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
}
}
mysql_close($conn);
$log = “本次抓取完成，请于脚本文件同部门查找log.txt日志记录\r\n”;
echo $log;
file_put_contents($logTxt,date(‘Y-m-d H:i:s’).” $log”,FILE_APPEND);
?>首先分析url地址，很traditional，分页参数为get传值page，这就好办了，一共有145页，一个for循环搞定，然后分析每页有哪些文章列表，获取每页的文章url，逐个去爬，各种for嵌套。。。没什么好搞的，反正没有ajax… 这里主要是写正则匹配麻烦，我算是领会到了很多人都会被坑的地方.*不能匹配空字符串，oh my god，算是长了教训，赶紧换成[\s\S]*

下载的图片

然后需要什么就用什么记录和输出日志，以及保存入库。当然，偶要把他的网站名称和域名全替换成偶哒…………………………….。改天有空找个模板，seo站点soft.feehi.com就要上线啦

转载请注明：飞嗨 » php写的爬虫，爬某博客所有文章和图片

一	二	三	四	五	六	日
« 三
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31