网站设计制作花多少钱,系统开发立项报告,参考消息官方网站阅读,作品集怎么做网页要生成 sitemap.xml 文件#xff0c;需要通过爬虫程序抓取网站的所有有效链接。以下是完整的解决方案#xff1a; 步骤 1#xff1a;安装必要的 Python 库
ounter(line
pip install requests beautifulsoup4 lxml 步骤 2#xff1a;创建 Python 爬虫脚本 (sitemap_genera…要生成 sitemap.xml 文件需要通过爬虫程序抓取网站的所有有效链接。以下是完整的解决方案 步骤 1安装必要的 Python 库
ounter(line
pip install requests beautifulsoup4 lxml 步骤 2创建 Python 爬虫脚本 (sitemap_generator.py)
ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line
import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urljoin, urlparseimport xml.etree.ElementTree as ETfrom datetime import datetime
def get_all_links(base_url): # 存储已访问和待访问的链接 visited set() queue [base_url] all_links set()while queue: url queue.pop(0) if url in visited: continuetry: headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 } response requests.get(url, headersheaders, timeout10) if response.status_code ! 200: continue# 添加到已访问列表 visited.add(url) all_links.add(url) print(fCrawled: {url})# 解析 HTML 获取新链接 soup BeautifulSoup(response.text, lxml) for link in soup.find_all(a, hrefTrue): href link[href].strip() full_url urljoin(url, href)# 过滤无效链接 parsed urlparse(full_url) if parsed.scheme not in (http, https): continue if not parsed.netloc.endswith(91kaiye.cn): # 仅限本站链接 continue if # in full_url: # 忽略锚点 full_url full_url.split(#)[0]# 添加到待访问队列 if full_url not in visited: queue.append(full_url)except Exception as e: print(fError crawling {url}: {str(e)})return all_links
def create_sitemap(links, filenamesitemap.xml): root ET.Element(urlset, xmlnshttp://www.sitemaps.org/schemas/sitemap/0.9) for link in sorted(links): url_elem ET.SubElement(root, url) ET.SubElement(url_elem, loc).text link ET.SubElement(url_elem, lastmod).text datetime.now().strftime(%Y-%m-%d) ET.SubElement(url_elem, changefreq).text daily ET.SubElement(url_elem, priority).text 0.8 tree ET.ElementTree(root) tree.write(filename, encodingutf-8, xml_declarationTrue) print(f\nSitemap generated: {filename} with {len(links)} URLs)
if __name__ __main__: base_url https://www.91kaiye.cn/ print(Starting crawl...) links get_all_links(base_url) create_sitemap(links) 步骤 3运行脚本
ounter(line
python sitemap_generator.py 执行说明 爬虫逻辑 从首页 https://www.91kaiye.cn/ 开始广度优先搜索 自动过滤非本站链接、锚点和无效 URL 记录每个页面的最后修改日期默认当天 设置更新频率为 daily优先级为 0.8 输出文件 生成的 sitemap.xml 格式如下 ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line ?xml version1.0 encodingutf-8?urlset xmlnshttp://www.sitemaps.org/schemas/sitemap/0.9 url lochttps://www.91kaiye.cn/page1/loc lastmod2023-10-05/lastmod changefreqdaily/changefreq priority0.8/priority /url .../urlset 注意事项 反爬措施 如果网站有反爬机制可能需要 添加 time.sleep(1) 延迟请求 使用代理 IP 设置更真实的请求头 动态内容 对于 JavaScript 渲染的页面如 Vue/React需改用 Selenium 或 Playwright 优化建议 在服务器上定期运行如每周一次 提交到 Google Search Console 在 robots.txt 中添加 ounter(line Sitemap: https://www.91kaiye.cn/sitemap.xml 替代方案使用在线工具
如果不想运行代码可用在线服务生成 XML-Sitemaps.com Screaming Frog SEO Spider桌面工具 生成后请将 sitemap.xml 上传到网站根目录并通过百度/Google站长工具提交。