当前位置：首页 > news >正文

中国3.15诚信建设联盟网站c 建设网站iis

news 2025/12/20 20:25:11

中国3.15诚信建设联盟网站,c 建设网站iis,微信小程序制作公司排行榜,怎么做微信小程序文章目录前言一、数据采集步骤及python库使用版本1. python库使用版本2. 数据采集步骤二、数据采集网页分析1. 分析采集的字段和URL1.1 分析要爬取的数据字段1.2 分析每部电影的URL1.2 分析每页的URL 2. 字段元素标签定位三、数据采集代码实现1. 爬取1905电影网分类信息2. 爬… 文章目录前言一、数据采集步骤及python库使用版本1. python库使用版本2. 数据采集步骤二、数据采集网页分析1. 分析采集的字段和URL1.1 分析要爬取的数据字段1.2 分析每部电影的URL1.2 分析每页的URL 2. 字段元素标签定位三、数据采集代码实现1. 爬取1905电影网分类信息2. 爬取电影主页HTML3. 解析html并把数据保存到csv文件四、数据清洗与存储代码实现前言本项目旨在通过爬取1905电影网的电影数据展示如何使用Python及相关库进行网页数据采集。本项目将详细介绍数据采集的步骤包括所需的Python库版本、网页分析、数据提取和保存等环节。我们将使用requests库进行网络请求利用BeautifulSoup进行HTML解析并将最终的数据保存为CSV文件便于后续分析和处理。一、数据采集步骤及python库使用版本 1. python库使用版本 pythonrequestsbs4beautifulsoup4soupsievelxmlpandassqlalchemymysql-connector-pythonselenium版本3.8.52.31.00.0.24.12.32.64.9.32.0.32.0.369.0.04.15.2 2. 数据采集步骤 #mermaid-svg-K6uTD1O1zpygD9of {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-K6uTD1O1zpygD9of .error-icon{fill:#552222;}#mermaid-svg-K6uTD1O1zpygD9of .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-K6uTD1O1zpygD9of .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-K6uTD1O1zpygD9of .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-K6uTD1O1zpygD9of .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-K6uTD1O1zpygD9of .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-K6uTD1O1zpygD9of .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-K6uTD1O1zpygD9of .marker{fill:#333333;stroke:#333333;}#mermaid-svg-K6uTD1O1zpygD9of .marker.cross{stroke:#333333;}#mermaid-svg-K6uTD1O1zpygD9of svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-K6uTD1O1zpygD9of .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-K6uTD1O1zpygD9of .cluster-label text{fill:#333;}#mermaid-svg-K6uTD1O1zpygD9of .cluster-label span{color:#333;}#mermaid-svg-K6uTD1O1zpygD9of .label text,#mermaid-svg-K6uTD1O1zpygD9of span{fill:#333;color:#333;}#mermaid-svg-K6uTD1O1zpygD9of .node rect,#mermaid-svg-K6uTD1O1zpygD9of .node circle,#mermaid-svg-K6uTD1O1zpygD9of .node ellipse,#mermaid-svg-K6uTD1O1zpygD9of .node polygon,#mermaid-svg-K6uTD1O1zpygD9of .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-K6uTD1O1zpygD9of .node .label{text-align:center;}#mermaid-svg-K6uTD1O1zpygD9of .node.clickable{cursor:pointer;}#mermaid-svg-K6uTD1O1zpygD9of .arrowheadPath{fill:#333333;}#mermaid-svg-K6uTD1O1zpygD9of .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-K6uTD1O1zpygD9of .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-K6uTD1O1zpygD9of .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-K6uTD1O1zpygD9of .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-K6uTD1O1zpygD9of .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-K6uTD1O1zpygD9of .cluster text{fill:#333;}#mermaid-svg-K6uTD1O1zpygD9of .cluster span{color:#333;}#mermaid-svg-K6uTD1O1zpygD9of div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-K6uTD1O1zpygD9of :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 进入1905电影网中国地区电影网页分析电影分页的URL 保存电影分页为HTML文件从电影分页的HTML文件中解析出每部电影的URL 保存每部电影主页为HTML文件从每部电影主页的HTML文件解析出需要的数据把解析出的数据保存到CSV文件中二、数据采集网页分析 1. 分析采集的字段和URL 1.1 分析要爬取的数据字段如下图所示红框部分是要爬取的数据包含电影标题、电影类型、电影时长、电影片名、电影别名、电影上映时间、电影编剧、电影导演、电影主演、电影剧情等字段。 1.2 分析每部电影的URL 访问中国地区的电影地址https://www.1905.com/mdb/film/list/country-China/ 如下图所示电影是分页显示每一页有多部电影点击单部电影后会调转到对应主页在对应的主页就有需要爬取的数据所以需要从每页中解析出单个电影的URL。如下图所示检查单部电影的源码后可以看到对应的URL。复制该部电影的URL为https://www.1905.com/mdb/film/2248201/2248201是这部电影的ID 那么就可以通过解析网页获取到每部电影的URL。 1.2 分析每页的URL 如下图所示检查源码后发现如下规律第二页的URL为https://www.1905.com/mdb/film/list/country-China/o0d0p2.html 第三页的URL为https://www.1905.com/mdb/film/list/country-China/o0d0p3.html 第四页的URL为https://www.1905.com/mdb/film/list/country-China/o0d0p4.html 第五页的URL为https://www.1905.com/mdb/film/list/country-China/o0d0p5.html 由此推断出第一页的URL为https://www.1905.com/mdb/film/list/country-China/o0d0p1.htmlo0d0p1.html可省略第n页的URL为https://www.1905.com/mdb/film/list/country-China/o0d0p{n}.html 2. 字段元素标签定位示例定位电影标题元素定位后的CSS选择器内容为 body div.topModule.normalCommon.normal_oneLine div div div.topModule_title.clearfix div.topModule_title_left.fl h3 span三、数据采集代码实现 1. 爬取1905电影网分类信息 import random import time from pathlib import Pathimport pandas as pd import requests from bs4 import BeautifulSoup 爬取1905电影网分类信息大分类 main_category小分类 sub_category链接 sub_category_link def get_request(url, **kwargs):time.sleep(random.uniform(0.1, 2))print(f请求地址{url} )# 定义一组User-Agent字符串user_agents [# ChromeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,# FirefoxMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0,# EdgeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0,# SafariMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15,]# 请求头headers {User-Agent: random.choice(user_agents)}# 用户名密码认证(私密代理/独享代理)username password proxies {http: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768},https: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768}}max_retries 3for attempt in range(max_retries):try:response requests.get(urlurl, timeout10, headersheaders, **kwargs)# response requests.get(urlurl, timeout10, headersheaders, proxiesproxies, **kwargs)if response.status_code 200:return responseelse:print(f请求失败状态码: {response.status_code}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))except requests.exceptions.RequestException as e:print(f请求过程中发生异常: {e}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))# 如果不是最后一次尝试则等待一段时间再重试if attempt max_retries - 1:time.sleep(random.uniform(1, 2))print(多次请求失败请查看异常情况)return None # 或者返回最后一次的响应取决于你的需求def get_soup(markup):return BeautifulSoup(markupmarkup, featureslxml)def save_categories_to_csv(response, csv_file_dir./data_csv/, csv_file_namecategory.csv):从HTML响应中提取分类信息并保存到CSV文件。参数:response (requests.Response): 包含HTML内容的响应对象。csv_file_dir (str): CSV文件存储目录默认为./data_csv/。csv_file_name (str): CSV文件名默认为category.csv。# 确保目录存在csv_file_dir_path Path(csv_file_dir)csv_file_dir_path.mkdir(parentsTrue, exist_okTrue)# 解析HTML文档soup get_soup(response.text)# 提取分类信息data_list []tag_srh_group soup.select(body div.layout.mainCont.clear div.leftArea div div.col-l-bd dl.srhGroup.clear)for tag_srh in tag_srh_group:tag_dt tag_srh.select_one(dt)main_category tag_dt.text.strip() if tag_dt is not None else Nonetag_a_list tag_srh.select(a)print(f解析后的数据如下)for tag_a in tag_a_list:if tag_a is not None:sub_category tag_a.text.strip()sub_category_link https://www.1905.com tag_a.get(href, )data_dict {main_category: main_category,sub_category: sub_category,sub_category_link: sub_category_link}data_list.append(data_dict)print(data_dict)# 创建DataFrame并清理数据df pd.DataFrame(data_list)df_cleaned df[df[sub_category].notna() (df[sub_category] ! )]print(f文件保存路径{csv_file_dir csv_file_name})# 保存到CSV文件df_cleaned.to_csv(csv_file_dir csv_file_name, indexFalse, encodingutf-8-sig)if __name__ __main__:res get_request(https://www.1905.com/mdb/film/search/)save_categories_to_csv(res)保存后的文件内容如下图所示 2. 爬取电影主页HTML import random import time from pathlib import Pathimport requests from bs4 import BeautifulSoupdef get_request(url, **kwargs):time.sleep(random.uniform(0.1, 2))print(f请求地址{url} )# 定义一组User-Agent字符串user_agents [# ChromeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36,# FirefoxMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0,Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0,# EdgeMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0,# SafariMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15,]# 请求头headers {User-Agent: random.choice(user_agents)}# 用户名密码认证(私密代理/独享代理)username password proxies {http: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768},https: http://%(user)s:%(pwd)s%(proxy)s/ % {user: username, pwd: password,proxy: 36.25.243.5:11768}}max_retries 3for attempt in range(max_retries):try:response requests.get(urlurl, timeout10, headersheaders, **kwargs)# response requests.get(urlurl, timeout10, headersheaders, proxiesproxies, **kwargs)if response.status_code 200:return responseelse:print(f请求失败状态码: {response.status_code}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))except requests.exceptions.RequestException as e:print(f请求过程中发生异常: {e}正在重新发送请求 (尝试 {attempt 1}/{max_retries}))# 如果不是最后一次尝试则等待一段时间再重试if attempt max_retries - 1:time.sleep(random.uniform(1, 2))print(多次请求失败请查看异常情况)return None # 或者返回最后一次的响应取决于你的需求def get_soup(markup):return BeautifulSoup(markupmarkup, featureslxml)def save_html_file(save_dir, file_name, content):dir_path Path(save_dir)# 确保保存目录存在如果不存在则创建所有必要的父级目录dir_path.mkdir(parentsTrue, exist_okTrue)# 使用 with 语句打开文件以确保正确关闭文件流with open(save_dir file_name, w, encodingutf-8) as fp:print(f{save_dir file_name} 文件已保存)fp.write(str(content))def save_rough_html_file():i 0save_dir ./rough_html/china/while True:i i 1file_name fo0d0p{i}.htmlfile_path Path(save_dir file_name)if file_path.exists() and file_path.is_file():print(f文件 {file_path} 已存在)continueurl fhttps://www.1905.com/mdb/film/list/country-China/o0d0p{i}.htmlresponse get_request(url)soup get_soup(response.text)tag_ul soup.select_one(body div.layout.mainCont.clear div.leftArea ul)if tag_ul.text.strip() is None or tag_ul.text.strip() :print(f网页爬取完成)breaksave_html_file(save_dir, file_name, response.text)def save_detail_info_html_file():i 0save_dir ./detail_html/china/while True:i i 1url fhttps://www.1905.com/mdb/film/list/country-China/o0d0p{i}.htmlresponse get_request(url)soup get_soup(response.text)tag_ul soup.select_one(body div.layout.mainCont.clear div.leftArea ul)if tag_ul.text.strip() is None or tag_ul.text.strip() :print(f网页爬取完成)breaktag_li_list tag_ul.select(li)for tag_li in tag_li_list:tag_a_href tag_li.find(a).attrs.get(href)movie_url fhttps://www.1905.com{tag_a_href}movie_id tag_a_href.split(/)[-2]file_name f{movie_id}.htmlfile_path Path(save_dir file_name)if file_path.exists() and file_path.is_file():print(f文件 {file_path} 已存在)continuedetail_response get_request(movie_url)if detail_response is None:continuesave_html_file(save_dir, file_name, detail_response.text)if __name__ __main__:# save_rough_html_file()save_detail_info_html_file()爬取后保存的部分html文件如下图所示 3. 解析html并把数据保存到csv文件 from pathlib import Pathimport pandas as pd from bs4 import BeautifulSoupdef get_soup(markup):return BeautifulSoup(markupmarkup, featureslxml)def parse_detail_html_to_csv():# 定义CSV文件路径csv_file_dir ../1905movie/data_csv/csv_file_name detail_1905movie_dataset.csvcsv_file_path Path(csv_file_dir csv_file_name)csv_file_dir_path Path(csv_file_dir)csv_file_dir_path.mkdir(parentsTrue, exist_okTrue)detail_dir Path(./detail_html/china/)detail_file_list detail_dir.rglob(*.html)movie_data_list []i 0count 0for detail_file in detail_file_list:movie_id str(detail_file).split(\\)[-1].split(.)[0]movie_url fhttps://www.1905.com/mdb/film/{movie_id}/soup get_soup(open(filedetail_file, moder, encodingutf-8))tag_img_url soup.select_one(div.topModule_bottom_poster.picHover.fl img)movie_img_url tag_img_url.attrs.get(src) if tag_img_url is not None else Nonetag_div_topmodule_title_right soup.select_one(div.topModule_title_right.fr)tag_evaluation_name tag_div_topmodule_title_right.select_one(div.evaluation-name)tag_judge_soon_fl tag_div_topmodule_title_right.select_one(div.judge-soon.fl)movie_rating tag_evaluation_name.text if tag_evaluation_name is not None else Nonemovie_status tag_judge_soon_fl.text if tag_judge_soon_fl is not None else 已上映tag_topmodule_title_left_fl soup.select_one(div.topModule_title_left.fl)tag_h3_span tag_topmodule_title_left_fl.select_one(h3 span)movie_title tag_h3_span.text if tag_h3_span is not None else Nonetag_li tag_topmodule_title_left_fl.select_one(li.topModule_line)movie_genres str(tag_li.find_next_sibling(li).text.strip()).split() if tag_li is not None else Nonetag_li5 tag_topmodule_title_left_fl.select_one(div ul li:nth-child(5))movie_duration tag_li5.text.strip() if tag_li5 is not None else Nonetag_div_left_top soup.select_one(div#left_top)tag_ul_consmodule_infos tag_div_left_top.select_one(ul.consModule_infos.consModule_infos_l.fixedWidth.fl) if tag_div_left_top is not None else Nonetag_li_em_release_date tag_ul_consmodule_infos.find(namespan,string上映时间) if tag_ul_consmodule_infos is not None else Nonemovie_release_date tag_li_em_release_date.find_next_sibling().text.strip() if tag_li_em_release_date is not None else Nonetag_li_em_director tag_ul_consmodule_infos.select_one(li em a) if tag_ul_consmodule_infos is not None else Nonemovie_director tag_li_em_director.text.strip() if tag_li_em_director is not None else Nonetag_ul_consmodule_infos_r tag_div_left_top.select_one(ul.consModule_infos.consModule_infos_r.fl) if tag_div_left_top is not None else Nonetag_alternative_titles tag_ul_consmodule_infos_r.select_one(li em) if tag_ul_consmodule_infos_r is not None else Nonemovie_alternative_titles tag_alternative_titles.text if tag_alternative_titles is not None else Nonetag_adaptation_source tag_ul_consmodule_infos_r.find(namespan,string改编来源) if tag_ul_consmodule_infos_r is not None else Nonemovie_adaptation_source tag_adaptation_source.find_next_sibling().text if tag_adaptation_source is not None else Nonetag_screenwriter tag_ul_consmodule_infos_r.select_one(li em a) if tag_ul_consmodule_infos_r is not None else Nonemovie_screenwriter tag_screenwriter.text if tag_screenwriter is not None else Nonetag_lead_actors soup.select_one(#left_top div ul li)tag_lead_actors_a_list tag_lead_actors.select(a) if tag_lead_actors is not None else Nonemovie_lead_actors [tag.text for tag in tag_lead_actors_a_list] if tag_lead_actors_a_list is not None else []tag_plot soup.select_one(#left_top ul li.plotItem.borderStyle div a)movie_plot tag_plot.text if tag_plot is not None else Nonemovie_data_dict {movie_id: movie_id,movie_url: movie_url,movie_img_url: movie_img_url,movie_duration: movie_duration,movie_title: movie_title,movie_director: movie_director,movie_release_date: movie_release_date,movie_status: movie_status,movie_rating: movie_rating,movie_genres: movie_genres,movie_lead_actors: movie_lead_actors,movie_alternative_titles: movie_alternative_titles,movie_adaptation_source: movie_adaptation_source,movie_screenwriter: movie_screenwriter,movie_plot: movie_plot}i i 1print(f第{i}行数据解析后的数据如下)print(movie_data_dict)print()movie_data_list.append(movie_data_dict)count count 1if count 200:df pd.DataFrame(movie_data_list)if not csv_file_path.exists():df.to_csv(csv_file_dir csv_file_name, indexFalse, encodingutf-8-sig)else:df.to_csv(csv_file_dir csv_file_name, indexFalse, encodingutf-8-sig, modea, headerFalse)movie_data_list []count 0print(f解析后的电影数据已保存到 {csv_file_dir csv_file_name} 文件)if count ! 0:df pd.DataFrame(movie_data_list)df.to_csv(csv_file_dir csv_file_name, indexFalse, encodingutf-8-sig, modea, headerFalse)print(f解析后的电影数据已全部保存到 {csv_file_dir csv_file_name} 文件)if __name__ __main__:parse_detail_html_to_csv()保存后的文件内容如下图四、数据清洗与存储代码实现 import re from datetime import datetimeimport pandas as pd from sqlalchemy import create_enginedef read_csv_to_df(file_path):# 加载CSV文件到DataFramereturn pd.read_csv(file_path, encodingutf-8)def contains_hours(text):if pd.isna(text): # 检查是否为 NaN 或 Nonereturn Falsepattern r\d\s*(小时|h|hours?|hrs?)return bool(re.search(pattern, text))def convert_to_minutes(duration_str):parts str(duration_str).replace(小时, ).replace(分钟, ).split()hours int(parts[0]) if len(parts) 0 else 0minutes int(parts[1]) if len(parts) 1 else 0return hours * 60 minutes# 定义一个函数来清理和标准化日期 def clean_and_standardize_date(date_str):date_str_cleaned str(date_str)# 移除括号及其内容if ( in date_str_cleaned:date_str_cleaned date_str.split(()[0]# 尝试匹配并解析完整的日期格式if 年 in date_str_cleaned and 月 in date_str_cleaned and 日 in date_str_cleaned:date_obj datetime.strptime(date_str_cleaned, %Y年%m月%d日)elif 年 in date_str_cleaned and 月 in date_str_cleaned:date_obj datetime.strptime(date_str_cleaned, %Y年%m月)date_obj date_obj.replace(day1) # 设置为该月的第一天elif 年 in date_str_cleaned:date_obj datetime.strptime(date_str_cleaned, %Y年)date_obj date_obj.replace(month1, day1) # 设置为该年的第一天else:return None # 如果不符合任何已知模式则返回 None 或其他默认值return date_obj.strftime(%Y-%m-%d) # 返回标准格式的字符串# 定义函数清理和转换数据格式 def clean_and_transform(df):# 筛选出电影状态为“已上映”的数据df df[df[movie_status] 已上映]# 删除电影标题为空的行df.dropna(subset[movie_title], inplaceTrue)# 删除id相同的数据df.drop_duplicates(subset[movie_id], inplaceTrue)# 电影时长字段处理df[movie_duration] df[movie_duration].apply(lambda x: x if contains_hours(x) else None)if df[movie_duration].isnull().sum() ! 0:df[movie_duration] df[movie_duration].fillna(methodffill)df[movie_duration] df[movie_duration].apply(convert_to_minutes)# 发布日期字段处理df[movie_release_date] df[movie_release_date].apply(clean_and_standardize_date)if df[movie_release_date].isnull().sum() ! 0:df[movie_release_date] df[movie_release_date].fillna(methodffill)# 评分字段处理df[movie_rating] df[movie_rating].astype(float).round(1)if df[movie_rating].isnull().sum() ! 0:df[movie_rating] df[movie_rating].interpolate()# 类型字段处理if df[movie_genres].isnull().sum() ! 0:df[movie_genres] df[movie_genres].fillna(methodffill)# 其他空值字段处理df df.fillna(未知)return dfdef save_df_to_db(df):# 设置数据库连接信息db_user rootdb_password zxcvbqdb_host 127.0.0.1 # 或者你的数据库主机地址db_port 3306 # MySQL默认端口是3306db_name movie1905# 创建数据库引擎engine create_engine(fmysqlmysqlconnector://{db_user}:{db_password}{db_host}:{db_port}/{db_name})# 将df写入MySQL表df.to_sql(namemovie1905_china, conengine, if_existsreplace, indexFalse)print(所有csv文件的数据已成功清洗并写入MySQL数据库)if __name__ __main__:csv_file r./data_csv/detail_1905movie_dataset.csvdataframe read_csv_to_df(csv_file)dataframe clean_and_transform(dataframe)save_df_to_db(dataframe)清洗并存储后的部分数据如下图所示

查看全文

http://www.pierceye.com/news/302330/