23 projects that python crawlers must know

23 projects that python crawlers must know

WechatSogou [1]-Wechat public account crawler. The WeChat official account crawler interface based on Sogou WeChat search can be extended to a crawler based on Sogou search. The returned result is a list, and each item is a dictionary of official account specific information. github address:  https://github.com/Chyroc/WechatSogou

DouBanSpider [2]-Douban reading crawler. You can climb down all books under the Douban reading label, and store them in order according to the rating rankings, and store them in Excel, which can be convenient for everyone to filter and collect, such as filtering high-scoring books with the number of evaluations> 1000; can be stored in different sheets of Excel according to different topics , Use User Agent to pretend to be a browser for crawling, and add random delays to better imitate browser behavior and prevent crawlers from being blocked. github address:  https://github.com/lanbing510/DouBanSpider

zhihu_spider [3] – Zhihu crawler. The function of this project is to crawl knowledge of user information and interpersonal topological relationships. The crawler framework uses scrapy, and the data storage uses mongo github address:  https://github.com/LiuRoy/zhihu_spider

bilibili-user [4]-Bilibili user crawler. Total data: 20119918, grabbing fields: user id, nickname, gender, avatar, level, experience value, number of fans, birthday, address, registration time, signature, level and experience value, etc. After the capture, a user data report of station B is generated. github address:  https://github.com/airingursb/bilibili-user

SinaSpider [5]-Sina Weibo crawler. It mainly crawls the personal information, Weibo information, fans and followers of Sina Weibo users. The code obtains the Sina Weibo Cookie to log in, and can prevent Sina’s anti-picking by logging in with multiple accounts. Mainly use scrapy crawler framework. github address:  https://github.com/LiuXingMing/SinaSpider

distribute_crawler [6]-Distributed crawler for downloading novels. A distributed web crawler implemented using scrapy, Redis, MongoDB, graphite, the underlying storage MongoDB cluster, distributed using Redis, crawler status display using graphite, mainly for a novel site. github address:  https://github.com/gnemoug/distribute_crawler

CnkiSpider [7]-China Knowledge Network crawler. After setting the search conditions, execute src/CnkiSpider.py to capture the data, and the captured data is stored in the/data directory. The first line of each data file is the field name. github address:  https://github.com/yanzhou/CnkiSpider

LianJiaSpider [8]-Lianjia.com crawler. Crawling the second-hand housing transaction records of Lianjia in Beijing over the years. Covers all the code of the Lianjia crawler article, including the Lianjia simulated login code. github address:  https://github.com/lanbing510/LianJiaSpider

scrapy_jingdong [9]– Jingdong crawler. JD website crawler based on scrapy, save format as csv. github address:  https://github.com/taizilongxu/scrapy_jingdong

QQ-Groups-Spider [10]-QQ group crawler. Grab QQ group information in batches, including group name, group number, group number, group owner, group profile, etc., and finally generate XLS(X)/CSV result files. github address: https://github.com/caspartse/QQ-Groups-Spider

wooyun_public[11]-dark cloud crawler. Wuyun exposes vulnerabilities, knowledge base crawlers and searches. The list of all public vulnerabilities and the text content of each vulnerability are stored in MongoDB, about 2G content; if the entire site crawls all text and pictures as offline query, it will take about 10G space and 2 hours (10M telecom bandwidth); crawl all knowledge Library, a total of about 500M space. The vulnerability search uses Flask as the web server and bootstrap as the front end.  https://github.com/hanc00l/wooyun_public

spider[12]-hao123 website crawler. Take hao123 as the entry page, scroll to crawl external links, collect URLs, and record the number of internal and external links on the URL, record title and other information, tested on windows7 32-bit, currently every 24 hours, can collect 100,000 Around  https://github.com/simapple/spider findtrip [13]-ticket crawler (Qunar and Ctrip). Findtrip is a ticket crawler based on Scrapy, which currently integrates two major domestic ticket websites (Qunar + Ctrip).  https://github.com/fankcoder/findtrip

163spider [14]-Netease client content crawler based on requests, MySQLdb, torndb  https://github.com/leyle/163spider

doubanspiders[15]-Doubanspiders of movies, books, groups, photo albums, stuff, etc.  https://github.com/fanpei91/doubanspiders

QQSpider [16]-QQ space crawler, including logs, talk, personal information, etc., can crawl 4 million pieces of data a day.  https://github.com/LiuXingMing/QQSpider

baidu-music-spider [17]-Baidu mp3 site crawler, using redis to support resumable uploads.  https://github.com/Shu-Ji/baidu-music-spider

tbcrawler[18]-The crawler of Taobao and Tmall can crawl the information of the page according to the search keywords and item id, and the data is stored in mongodb.  https://github.com/pakoo/tbcrawler

stockholm [19]-a stock data (Shanghai and Shenzhen) crawler and stock selection strategy testing framework. According to the selected date range, all the stock market data of Shanghai and Shenzhen stocks are captured. Support the use of expressions to define stock selection strategies. Support multi-threaded processing. Save data to JSON file, CSV file.  https://github.com/benitoro/stockholm

BaiduyunSpider[20]-Baidu cloud disk crawler.  https://github.com/k1995/BaiduyunSpider

Spider[21]-social data crawler. Support Weibo, Zhihu, Douban.  https://github.com/Qutan/Spider

proxy pool[22]-Python crawler proxy IP pool (proxy pool).  https://github.com/jhao104/proxy_pool

music-163[23]-Crawling comments on all songs of NetEase Cloud Music.  https://github.com/RitterHou/music-163

Reprinted: http://www.sohu.com/a/166385794_804770

Reference: https://cloud.tencent.com/developer/article/1390473 Python crawler must know 23 projects-Cloud + Community-Tencent Cloud