Python实现网易云音乐自动抓取整理
事情的缘起很简单,我的jellyfin音乐库听腻的差不多了,想抓点新歌。虽然说之前已经抓取过1000多首歌曲,但是很多是我因为强迫症,为了媒体库的完整性,把整个专辑都抓下来了,实际上常听的就是两三百首。
正因如此,我就想给我的媒体库补充点库存,于是开始折腾,有了这篇文章。
实现思路
上次因为没能自动化搞下载,且因为心急,所以几乎是半自动半手动收集的,费时费力,花费了差不多一个星期。而这次,我决定使用Python自动化完成,下面是工作流实现思路:
- 当遇到喜欢的音乐后,我就保存其网易云的URL到文本文件,例如
https://music.163.com/#/song?id=1499823984你可以点开看看是什么曲子(,每个曲子放一行,随听随收集。 - 读取这个歌曲列表文件,逐行解析,获取专辑的ID,例如
https://music.163.com/#/album?id=99010037,取出99010037,以及歌手KOTOKO。 - 然后去解析专辑的页面,获取其中所有歌曲的
song?id,例如1499823984,1499826814,1499823975和1499826811。 - 调用第三方API进行歌曲解析,获取音频文件,歌词和封面。例如这次我用的就是
https://jx.chksz.top/ 按照jellyfin要求的格式保存歌曲文件,也就是
歌手/专辑/01.歌曲.flac,如:[chocola@Neko-Server KOTOKO]$ tree . └── ネコぱら vol.4 OP ⁄ ED ├── 01.SWEET×SWEET.flac ├── 01.SWEET×SWEET.lrc ├── 02.NEGAIGOTO.flac ├── 02.NEGAIGOTO.lrc ├── 03.SWEET×SWEET (Instrumental).flac ├── 03.SWEET×SWEET (Instrumental).lrc ├── 04.NEGAIGOTO (Instrumental).flac ├── 04.NEGAIGOTO (Instrumental).lrc └── cover.jpg- 最后还要打上艺术家,专辑,歌曲名的标签
既然确定了思路和方法,接下来开始逐个攻破吧!
观察网页布局
打开音乐网页,开启F12大法,检查元素,迅速就定位到了目标:
其实,后面发现在页面的meta标签其实也有信息:
对于专辑部分,抓取列表就好了:
看起来并不难,使用Python的requests模块获取html,然后解析内容就好了。
尝试实现爬虫逻辑
和LLM友好交流后,得出了每一部分的代码逻辑函数,例如解析歌曲页面信息的:
def extract_music_info(url):
# 模拟浏览器请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
try:
# 获取HTML内容
response = requests.get(url, headers=headers)
response.raise_for_status()
html = response.text
print(html)
# 提取艺术家(og:music:artist)
artist_match = re.search(r'<meta property="og:music:artist" content="([^"]+)"', html)
artist = artist_match.group(1) if artist_match else None
# 提取专辑名称(og:music:album)
album_name_match = re.search(r'<meta property="og:music:album" content="([^"]+)"', html)
album_name = album_name_match.group(1) if album_name_match else None
# 提取专辑ID(从music:album的URL中解析)
album_url_match = re.search(r'<meta property="music:album" content="([^"]+)"', html)
album_id = None
if album_url_match:
url_content = album_url_match.group(1)
id_match = re.search(r'id=(\d+)', url_content)
album_id = id_match.group(1) if id_match else None
# 检查是否所有字段都提取成功
if artist and album_name and album_id:
return (artist, album_name, album_id)
else:
return None
except Exception as e:
print(f"提取失败: {str(e)}")
return None但是好巧不巧,我写完代码的第二天,正准备要爬取,却返回了空值。我就纳闷了,我昨天还能运行的代码,怎么今天就出问题了?然后重新检查了页面,才发现第二天后网易云把页面嵌入了iframe标签,让我的代码全部报废(
还真是赶到点子上了呢,由于这部分加载实现需要模拟出来浏览器逻辑,所以我决定还是上selenium吧,尽管我并不喜欢这种臃肿低效的爬虫方式,但是现在看来如果想快速完成任务只能这样了。iframe标签反爬这招太狠了!
selenium 搭配 firefox 大战 iframe
配置和熟悉selenium
首先,安装selenium:
pip install selenium然后要去下载selenium的firefox驱动,前往github.com/mozilla/geckodriver/releases下载对应的版本,这里我下载的是geckodriver-v0.36.0-linux64.tar.gz,下载好后放在项目文件夹下。
要用selenium首先要熟悉基本的用法,以下是简单示例:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options # 新增导入
# 创建 Service 对象,指定 geckodriver 的路径
service = Service(executable_path='./geckodriver')
# 创建 Options 对象,指定 Firefox 的二进制路径
options = Options()
options.binary_location = "/opt/firefox-esr/firefox" # 替换为你的实际路径
# 初始化 Firefox 驱动,同时传入 service 和 options
driver = webdriver.Firefox(service=service, options=options)
driver.get('https://www.nekopara.uk/')
print(driver.page_source)
driver.quit()大概就是获取了页面渲染后的html代码然后返回。
进行爬取
下面是爬取的实现函数,由于如果不关闭浏览器的话,获取的iframe标签还是第一次打开页面的内容,没办法复用,只能每次爬取完成后关闭浏览器然后重新打开。
def extract_music_info(url):
# 创建 Service 对象
service = Service(executable_path='./geckodriver')
# 创建 Options 对象
options = Options()
options.binary_location = "/opt/firefox-esr/firefox" # 替换为你的实际路径
# 初始化 Firefox 驱动
driver = webdriver.Firefox(service=service, options=options)
try:
# 访问目标页面
driver.get(url)
# 等待iframe加载完成
iframe = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.ID, "g_iframe"))
)
# 切换到iframe
driver.switch_to.frame(iframe)
# 等待iframe内部内容加载(等待10秒)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
# 获取iframe中的完整HTML
html = driver.page_source
# 提取艺术家(og:music:artist)
artist_match = re.search(r'<meta property="og:music:artist" content="([^"]+)"', html)
artist = artist_match.group(1) if artist_match else None
# 提取专辑名称(og:music:album)
album_name_match = re.search(r'<meta property="og:music:album" content="([^"]+)"', html)
album_name = album_name_match.group(1) if album_name_match else None
# 提取专辑ID(从music:album的URL中解析)
album_url_match = re.search(r'<meta property="music:album" content="([^"]+)"', html)
album_id = None
if album_url_match:
url_content = album_url_match.group(1)
id_match = re.search(r'id=(\d+)', url_content)
album_id = id_match.group(1) if id_match else None
# 检查是否所有字段都提取成功
if artist and album_name and album_id:
return (artist, album_name, album_id)
else:
return None
except Exception as e:
print(f"❌ 处理失败: {str(e)}")
finally:
driver.quit()
def get_album_info(url):
# 创建 Service 对象
service = Service(executable_path='./geckodriver')
# 创建 Options 对象
options = Options()
options.binary_location = "/opt/firefox-esr/firefox" # 替换为你的实际路径
# 初始化 Firefox 驱动
driver = webdriver.Firefox(service=service, options=options)
try:
# 访问目标页面
driver.get(url)
# 等待iframe加载完成
iframe = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.ID, "g_iframe"))
)
# 切换到iframe
driver.switch_to.frame(iframe)
# 等待iframe内部内容加载(等待10秒)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
# 获取iframe中的完整HTML
iframe_html = driver.page_source
soup = BeautifulSoup(iframe_html, 'html.parser')
song_dict = {}
# 遍历所有歌曲行(tr元素)
for tr in soup.find_all('tr'):
# 提取页面序号(在<span class="num">中)
num_span = tr.find('span', class_='num')
if not num_span:
continue
try:
page_num = int(num_span.get_text(strip=True))
except ValueError:
continue
# 提取歌曲ID(在<a href="/song?id=XXX">中)
a_tag = tr.find('a', href=lambda href: href and href.startswith('/song?id='))
if not a_tag:
continue
href = a_tag['href']
song_id = href.split('=')[1]
song_dict[page_num] = song_id
return song_dict
except Exception as e:
print(f"❌ 处理失败: {str(e)}")
finally:
driver.quit()通过driver.switch_to.frame(iframe)才能获取到iframe内容,不然就和直接用python的requests模块的请求结果是一样的。
这样我们就能获取到了艺术家名称,专辑名称,歌曲ID,可以进行下一步工作了。
确实感觉自动化浏览器访问这个是爬虫的终极手段了,几乎什么都可以爬取了,只不过效率确实不如直接发送请求。
从第三方解析服务API获取信息
这里我用了https://jx.chksz.top/这个网站的解析服务,是在搜索引擎里面随便找的。简单看了一下,好像是一个初中生的,真是年少有为呢!
一番F12后,抓取到了API的地址:https://api.kxzjoker.cn/api/163_music

接下来就是构造请求并下载的函数了:
def download_song_and_metadata(save_path:str,song_url: str,number_info:str, level: str = "jymaster") -> bool:
# === 第一步:请求你的解析 API ===
api_endpoint = "https://api.kxzjoker.cn/api/163_music"
form_data = {
"url": song_url,
"level": level,
"type": "json"
}
try:
print("正在请求歌曲信息...")
resp = requests.post(api_endpoint, data=form_data, timeout=15)
resp.raise_for_status()
data = resp.json()
if data.get("status") != 200:
print(f"API 返回错误状态: {data.get('status')}")
return False
# 提取字段
title = sanitize_filename(data["name"])
artist = data["ar_name"]
album = sanitize_filename(data["al_name"])
audio_url = data["url"]
cover_url = data["pic"]
lyric = data.get("lyric", "")
tlyric = data.get("tlyric", "")
# === 第二步:创建专辑目录 ===
album_dir = os.path.join(save_path, album)
os.makedirs(album_dir, exist_ok=True)
# === 第三步:下载音频文件 ===
audio_ext = ".flac" if ".flac" in audio_url else ".mp3"
audio_path = os.path.join(album_dir, f"{number_info}.{title}{audio_ext}")
if not os.path.exists(audio_path):
print(f"正在下载音频: {title}{audio_ext}")
audio_resp = requests.get(audio_url, stream=True, timeout=30)
audio_resp.raise_for_status()
with open(audio_path, "wb") as f:
for chunk in audio_resp.iter_content(chunk_size=8192):
f.write(chunk)
else:
print(f"音频已存在,跳过: {audio_path}")
# === 第四步:合并歌词并保存 .lrc ===
lrc_path = os.path.join(album_dir, f"{number_info}.{title}.lrc")
if tlyric != None:
combined_lyric = merge_lyrics(lyric, tlyric)
else:
combined_lyric = lyric
with open(lrc_path, "w", encoding="utf-8") as f:
f.write(combined_lyric)
print(f"歌词已保存: {lrc_path}")
# === 第五步:下载封面为 folder.jpg ===
cover_path = os.path.join(album_dir, "folder.jpg")
if not os.path.exists(cover_path):
print("正在下载专辑封面...")
cover_resp = requests.get(cover_url, timeout=15)
cover_resp.raise_for_status()
with open(cover_path, "wb") as f:
f.write(cover_resp.content)
print(f"封面已保存: {cover_path}")
else:
print("封面已存在,跳过")
print(f"\n✅ 下载完成!路径: {album_dir}")
return True
except Exception as e:
print(f"❌ 下载失败: {e}")
return False
def merge_lyrics(original: str, translation: str) -> str:
"""
合并原文和翻译歌词,生成双语 LRC。
假设时间戳一致,将翻译放在下一行。
"""
orig_lines = original.strip().splitlines()
trans_dict = {}
# 解析翻译歌词的时间 -> 文本
for line in translation.strip().splitlines():
if "]:" in line or line.startswith("[by:"):
continue
parts = line.split("]", 1)
if len(parts) == 2:
time_tag, text = parts[0] + "]", parts[1].strip()
if text:
trans_dict[time_tag] = text
# 合并
merged = []
for line in orig_lines:
merged.append(line)
if line.startswith("[") and "]" in line:
time_tag = line.split("]", 1)[0] + "]"
if time_tag in trans_dict:
merged.append(time_tag + trans_dict[time_tag])
return "\n".join(merged)最终缝合
得到了各个部分的组件,然后就是缝合工作了,只需要传入一个分行的歌曲的URL文本文件,程序就可以全自动下载全部的相关文件,并且根据存放规范保存。
下面是最终缝合代码:
import requests
import re
from urllib.parse import urlparse
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options # 新增导入
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
from bs4 import BeautifulSoup
def get_album_info(url):
# 创建 Service 对象
service = Service(executable_path='./geckodriver')
# 创建 Options 对象
options = Options()
options.binary_location = "/opt/firefox-esr/firefox" # 替换为你的实际路径
# 初始化 Firefox 驱动
driver = webdriver.Firefox(service=service, options=options)
try:
# 访问目标页面
driver.get(url)
# 等待iframe加载完成
iframe = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.ID, "g_iframe"))
)
# 切换到iframe
driver.switch_to.frame(iframe)
# 等待iframe内部内容加载(等待10秒)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
# 获取iframe中的完整HTML
iframe_html = driver.page_source
soup = BeautifulSoup(iframe_html, 'html.parser')
song_dict = {}
# 遍历所有歌曲行(tr元素)
for tr in soup.find_all('tr'):
# 提取页面序号(在<span class="num">中)
num_span = tr.find('span', class_='num')
if not num_span:
continue
try:
page_num = int(num_span.get_text(strip=True))
except ValueError:
continue
# 提取歌曲ID(在<a href="/song?id=XXX">中)
a_tag = tr.find('a', href=lambda href: href and href.startswith('/song?id='))
if not a_tag:
continue
href = a_tag['href']
song_id = href.split('=')[1]
song_dict[page_num] = song_id
return song_dict
except Exception as e:
print(f"❌ 处理失败: {str(e)}")
finally:
driver.quit()
def sanitize_filename(filename: str) -> str:
"""移除文件名中的非法字符"""
return re.sub(r'[<>:"/\\|?*\x00-\x1f]', '_', filename).strip()
def download_song_and_metadata(save_path:str,song_url: str,number_info:str, level: str = "jymaster") -> bool:
# === 第一步:请求你的解析 API ===
api_endpoint = "https://api.kxzjoker.cn/api/163_music"
form_data = {
"url": song_url,
"level": level,
"type": "json"
}
try:
print("正在请求歌曲信息...")
resp = requests.post(api_endpoint, data=form_data, timeout=15)
resp.raise_for_status()
data = resp.json()
if data.get("status") != 200:
print(f"API 返回错误状态: {data.get('status')}")
return False
# 提取字段
title = sanitize_filename(data["name"])
artist = data["ar_name"]
album = sanitize_filename(data["al_name"])
audio_url = data["url"]
cover_url = data["pic"]
lyric = data.get("lyric", "")
tlyric = data.get("tlyric", "")
# === 第二步:创建专辑目录 ===
album_dir = os.path.join(save_path, album)
os.makedirs(album_dir, exist_ok=True)
# === 第三步:下载音频文件 ===
audio_ext = ".flac" if ".flac" in audio_url else ".mp3"
audio_path = os.path.join(album_dir, f"{number_info}.{title}{audio_ext}")
if not os.path.exists(audio_path):
print(f"正在下载音频: {title}{audio_ext}")
audio_resp = requests.get(audio_url, stream=True, timeout=30)
audio_resp.raise_for_status()
with open(audio_path, "wb") as f:
for chunk in audio_resp.iter_content(chunk_size=8192):
f.write(chunk)
else:
print(f"音频已存在,跳过: {audio_path}")
# === 第四步:合并歌词并保存 .lrc ===
lrc_path = os.path.join(album_dir, f"{number_info}.{title}.lrc")
if tlyric != None:
combined_lyric = merge_lyrics(lyric, tlyric)
else:
combined_lyric = lyric
with open(lrc_path, "w", encoding="utf-8") as f:
f.write(combined_lyric)
print(f"歌词已保存: {lrc_path}")
# === 第五步:下载封面为 folder.jpg ===
cover_path = os.path.join(album_dir, "folder.jpg")
if not os.path.exists(cover_path):
print("正在下载专辑封面...")
cover_resp = requests.get(cover_url, timeout=15)
cover_resp.raise_for_status()
with open(cover_path, "wb") as f:
f.write(cover_resp.content)
print(f"封面已保存: {cover_path}")
else:
print("封面已存在,跳过")
print(f"\n✅ 下载完成!路径: {album_dir}")
return True
except Exception as e:
print(f"❌ 下载失败: {e}")
return False
def merge_lyrics(original: str, translation: str) -> str:
"""
合并原文和翻译歌词,生成双语 LRC。
假设时间戳一致,将翻译放在下一行。
"""
orig_lines = original.strip().splitlines()
trans_dict = {}
# 解析翻译歌词的时间 -> 文本
for line in translation.strip().splitlines():
if "]:" in line or line.startswith("[by:"):
continue
parts = line.split("]", 1)
if len(parts) == 2:
time_tag, text = parts[0] + "]", parts[1].strip()
if text:
trans_dict[time_tag] = text
# 合并
merged = []
for line in orig_lines:
merged.append(line)
if line.startswith("[") and "]" in line:
time_tag = line.split("]", 1)[0] + "]"
if time_tag in trans_dict:
merged.append(time_tag + trans_dict[time_tag])
return "\n".join(merged)
def extract_music_info(url):
# 创建 Service 对象
service = Service(executable_path='./geckodriver')
# 创建 Options 对象
options = Options()
options.binary_location = "/opt/firefox-esr/firefox" # 替换为你的实际路径
# 初始化 Firefox 驱动
driver = webdriver.Firefox(service=service, options=options)
try:
# 访问目标页面
driver.get(url)
# 等待iframe加载完成
iframe = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.ID, "g_iframe"))
)
# 切换到iframe
driver.switch_to.frame(iframe)
# 等待iframe内部内容加载(等待10秒)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
# 获取iframe中的完整HTML
html = driver.page_source
# 提取艺术家(og:music:artist)
artist_match = re.search(r'<meta property="og:music:artist" content="([^"]+)"', html)
artist = artist_match.group(1) if artist_match else None
# 提取专辑名称(og:music:album)
album_name_match = re.search(r'<meta property="og:music:album" content="([^"]+)"', html)
album_name = album_name_match.group(1) if album_name_match else None
# 提取专辑ID(从music:album的URL中解析)
album_url_match = re.search(r'<meta property="music:album" content="([^"]+)"', html)
album_id = None
if album_url_match:
url_content = album_url_match.group(1)
id_match = re.search(r'id=(\d+)', url_content)
album_id = id_match.group(1) if id_match else None
# 检查是否所有字段都提取成功
if artist and album_name and album_id:
return (artist, album_name, album_id)
else:
return None
except Exception as e:
print(f"❌ 处理失败: {str(e)}")
finally:
driver.quit()
def download_album_from_song(url):
info_result = extract_music_info(url)
if info_result:
print(f"艺术家: {info_result[0]}")
print(f"专辑名称: {info_result[1]}")
print(f"专辑ID: {info_result[2]}")
else:
print("提取失败,检查URL或页面结构")
download_path = str(os.getcwd()+"/"+info_result[0])
songid_s = get_album_info("https://music.163.com/#/album?id="+info_result[2])
for number_info,songid_in in songid_s.items():
to_download_url = "https://music.163.com/song?id="+songid_in
download_song_and_metadata(download_path,to_download_url,str(number_info), level="jymaster")
def extract_songlist(filename: str):
"""
主函数:读取指定文件并逐行处理。
:param filename: 要读取的 txt 文件名(位于当前目录)
"""
# 获取当前工作目录
current_dir = os.getcwd()
file_path = os.path.join(current_dir, filename)
# 检查文件是否存在
if not os.path.isfile(file_path):
print(f"错误:文件 '{filename}' 不存在于当前目录 ({current_dir})")
return
try:
with open(file_path, 'r', encoding='utf-8') as file:
for line_number, line in enumerate(file, start=1):
# 可选:跳过空行
if line.strip() == '':
continue
try:
download_album_from_song(line)
except Exception as e:
print(f"处理第 {line_number} 行时出错: {e}")
except Exception as e:
print(f"读取文件时发生错误: {e}")
# 使用示例
if __name__ == "__main__":
txt_filename = "collect_list.txt"
extract_songlist(txt_filename)装好需要的Python模块后,只需要在当前目录保存好collect_list.txt,放置浏览器驱动geckodriver,确保路径都正确,就可以Python启动了。过一会就会在当前目录获得想要的专辑文件了。
整理音乐文件标签
由于下载的flac缺少元数据,所以需要根据路径和文件名进行补全。这时候我也意识到我之前爬取的那一批媒体库文件并没有打好标签,导致jellyfin的展示效果还是有一定混乱。
下面这个Python程序将会修复这个问题
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
将文件名中的曲号与曲名写入音频文件元数据(支持 FLAC/MP3/M4A)
用法示例:
python3 write_title_and_track.py /home/chocola/data/files/Music --dry-run
python3 write_title_and_track.py /home/chocola/data/files/Music --force
"""
import os, sys, re, argparse
from pathlib import Path
# try import mutagen, auto-install if missing (user mode)
try:
from mutagen import File
from mutagen.flac import FLAC
from mutagen.id3 import ID3, TIT2, TRCK, ID3NoHeaderError
from mutagen.mp4 import MP4, MP4Tags
except Exception:
print("mutagen 未检测到,尝试安装(用户模式 pip3 install --user mutagen)...")
import subprocess
subprocess.check_call([sys.executable, "-m", "pip", "install", "--user", "mutagen"])
from mutagen import File
from mutagen.flac import FLAC
from mutagen.id3 import ID3, TIT2, TRCK, ID3NoHeaderError
from mutagen.mp4 import MP4, MP4Tags
AUDIO_EXTS = ('.flac', '.mp3', '.m4a', '.mp4', '.aac', '.ogg', '.opus') # will skip unknown handlers
# regex to extract track number and title
# matches: "01. Title", "1 - Title", "01 Title", "01_ Title", "01.Title", leading zeros allowed
_leading_track_re = re.compile(r'^\s*(\d{1,3})\s*[\.\-_\s:]+\s*(.+)$')
# fallback: "01Title" (no separator)
_leading_track_nosep_re = re.compile(r'^\s*(\d{1,3})(.+)$')
# normalize fullwidth parentheses and brackets to ascii
def normalize_parens(s: str) -> str:
return (s
.replace('(', '(').replace(')', ')')
.replace('[', '[').replace(']', ']')
.replace('【', '[').replace('】', ']')
)
def parse_filename_to_track_title(fn: str):
name = fn
# remove extension already handled by caller usually
name = name.strip()
name = normalize_parens(name)
# try main regex
m = _leading_track_re.match(name)
if m:
tr = m.group(1).lstrip('0') or '0'
title = m.group(2).strip()
return tr, title
m2 = _leading_track_nosep_re.match(name)
if m2:
tr = m2.group(1).lstrip('0') or '0'
title = m2.group(2).strip()
# but if title is short (like only 1 char) and looks numeric, ignore
return tr, title
# No leading track detected: return (None, full name)
return None, name
def write_flac(path: Path, artist: str, album: str, tracknum: str, title: str, dry_run=False, force=False):
audio = FLAC(path)
tags = audio.tags or {}
changed = False
# ensure artist/album set if available
if artist and (not tags.get('artist') or tags.get('artist') != [artist]):
if not dry_run:
audio['artist'] = [artist]
changed = True
if album and (not tags.get('album') or tags.get('album') != [album]):
if not dry_run:
audio['album'] = [album]
changed = True
# title
cur_title = tags.get('title', None)
if (force or not cur_title) and title:
if not dry_run:
audio['title'] = [title]
changed = True
# tracknumber
cur_track = tags.get('tracknumber', None)
if tracknum and (force or not cur_track):
if not dry_run:
audio['tracknumber'] = [str(tracknum)]
changed = True
if changed and not dry_run:
audio.save()
return changed
def write_mp3(path: Path, artist: str, album: str, tracknum: str, title: str, dry_run=False, force=False):
try:
audio = ID3(path)
except ID3NoHeaderError:
audio = ID3()
changed = False
# artist / album frames: TPE1 / TALB
if artist:
cur = audio.getall('TPE1')
cur_text = [f.text for f in cur] if cur else []
if force or not cur_text:
if not dry_run:
audio.delall('TPE1')
audio.add(TIT2(encoding=3, text=[])) if False else None # noop to avoid pyflakes
audio.add(ID3().get('TPE1')) if False else None
# we'll set via frames below
# set title
cur_title = audio.getall('TIT2')
if (force or not cur_title) and title:
if not dry_run:
audio.delall('TIT2')
audio.add(TIT2(encoding=3, text=title))
changed = True
# set tracknumber TRCK
cur_tr = audio.getall('TRCK')
if tracknum and (force or not cur_tr):
if not dry_run:
audio.delall('TRCK')
audio.add(TRCK(encoding=3, text=str(tracknum)))
changed = True
# artist/album set via text frames TPE1/TALB
if artist:
cur_a = audio.getall('TPE1')
if (force or not cur_a):
if not dry_run:
audio.delall('TPE1')
audio.add(ID3().get('TPE1')) if False else None
from mutagen.id3 import TPE1, TALB
audio.add(TPE1(encoding=3, text=artist))
changed = True
if album:
cur_al = audio.getall('TALB')
if (force or not cur_al):
if not dry_run:
from mutagen.id3 import TALB
audio.delall('TALB')
audio.add(TALB(encoding=3, text=album))
changed = True
if changed and not dry_run:
audio.save(path)
return changed
def write_mp4(path: Path, artist: str, album: str, tracknum: str, title: str, dry_run=False, force=False):
audio = MP4(path)
tags = audio.tags or MP4Tags()
changed = False
# artist
if artist and (tags.get('\xa9ART') != [artist]):
if not dry_run:
tags['\xa9ART'] = [artist]
changed = True
# album
if album and (tags.get('\xa9alb') != [album]):
if not dry_run:
tags['\xa9alb'] = [album]
changed = True
# title
if (force or not tags.get('\xa9nam')) and title:
if not dry_run:
tags['\xa9nam'] = [title]
changed = True
# track number (trkn) is a tuple (track, total)
if tracknum and (force or not tags.get('trkn')):
try:
tr = int(tracknum)
except Exception:
tr = None
if tr:
if not dry_run:
tags['trkn'] = [(tr, 0)]
changed = True
if changed and not dry_run:
audio.tags = tags
audio.save()
return changed
def process_file(path: Path, root: Path, dry_run=False, force=False):
# derive artist/album from path: root/.../Artist/Album/file
rel = path.relative_to(root)
parts = rel.parts
if len(parts) < 3:
# not deep enough; skip
return False, "path_too_shallow"
album = parts[-2]
artist = parts[-3]
stem = path.stem # name without extension
# Normalize Unicode around parentheses
stem = normalize_parens(stem)
tracknum, title = parse_filename_to_track_title(stem)
# if no tracknum found, leave tracknum None and title entire stem
if tracknum is None:
# remove trailing bracketed translations, e.g. "SongName (中文翻译)" -> keep whole as title
title = stem.strip()
# prefer title if present
# write depending on extension
ext = path.suffix.lower()
try:
if ext == '.flac':
changed = write_flac(path, artist, album, tracknum, title, dry_run=dry_run, force=force)
elif ext == '.mp3':
changed = write_mp3(path, artist, album, tracknum, title, dry_run=dry_run, force=force)
elif ext in ('.m4a', '.mp4'):
changed = write_mp4(path, artist, album, tracknum, title, dry_run=dry_run, force=force)
else:
# unsupported ext -> try generic mutagen write as fallback
audio = File(path)
if audio is None:
return False, "unsupported_format"
# best-effort: set common tags
changed = False
if hasattr(audio, 'tags'):
if force or not audio.tags.get('title'):
if not dry_run:
audio.tags['title'] = title
changed = True
if force or not audio.tags.get('artist'):
if not dry_run:
audio.tags['artist'] = artist
changed = True
if force or not audio.tags.get('album'):
if not dry_run:
audio.tags['album'] = album
changed = True
if tracknum and (force or not audio.tags.get('tracknumber')):
if not dry_run:
audio.tags['tracknumber'] = str(tracknum)
changed = True
if changed and not dry_run:
audio.save()
return changed, "done"
return changed, "done"
except Exception as e:
return False, f"error:{e}"
def main():
p = argparse.ArgumentParser(description="从文件名写入 title 与 track 到音频元数据(支持 FLAC/MP3/M4A)")
p.add_argument("root", help="音乐库根目录(会递归)")
p.add_argument("--dry-run", action="store_true", help="仅预览,不写入")
p.add_argument("--force", action="store_true", help="强制覆盖已有标签")
p.add_argument("--limit", type=int, default=0, help="仅处理前 N 个文件(测试用)")
args = p.parse_args()
root = Path(args.root).resolve()
if not root.is_dir():
print("错误:root 不是目录:", root); sys.exit(1)
total = 0
changed_cnt = 0
for dirpath, dirnames, filenames in os.walk(root):
for fn in filenames:
if not fn.lower().endswith(AUDIO_EXTS):
continue
total += 1
path = Path(dirpath) / fn
changed, reason = process_file(path, root, dry_run=args.dry_run, force=args.force)
if changed:
changed_cnt += 1
print(("DRY " if args.dry_run else "") + f"修改: {path} ({reason})")
# optional limit
if args.limit and total >= args.limit:
break
if args.limit and total >= args.limit:
break
print(f"完成: 总文件={total}, 修改={changed_cnt}")
if __name__ == "__main__":
main()直接把音乐库路径传入进去执行即可,例如:
python tag_music.py /run/media/chocola/6T-DATA/Projects/Get_Information/Fuck_Netease/Music/注意,目录下的存放格式要和之前说的一样,也是就是艺术家/专辑/01.歌曲名.flac
执行完成后,你就会得到一个自己的媒体库啦!
但是,这样就结束了,吗?
揭露网易云伪母带骗局
如果你有留意到的话,你应该会发现我抓取的歌曲参数,选择的是最高音质,也就是jymaster超清母带。
母带的定义是什么呢?
母带(Master)是音乐制作流程中的最终成品音频文件,用于后续复制、发行和流媒体分发。它是在混音完成后,通过母带处理(Mastering)对整体音频进行精细调整,包括均衡、压缩、限幅、动态控制等,以确保在各种播放设备上具有一致的音质表现,并符合行业标准的响度与格式要求。母带不仅是物理或数字发行的源文件,也代表了作品最终的声音面貌,是连接创作与听众的关键环节。高质量的母带能提升音乐的整体清晰度、空间感和专业度。
看着好像说很好啊,但是实际上,一般情况下发行的音乐文件大部分都是CD导出的,也就是平时我们能接触到的音质天花板就是CD无损音质16-bit/44.1kHz。而网易云平台上的所谓母带,绝大部分都是基于CD音质后期超上去的,保存这些庞大的文件不但对听感没有太大提升,反而还更加占用储存空间。这也是和群里面玩耳机的群友交流才得知的信息。
如果不信,你可以安装个sox分析一下音频频谱图。
sox 1.SWEET×SWEET.flac -n spectrogram -o ./infox.png
可以很明显的发现能量主要集中在0–20kHz区域,呈现黄白色/橙色/红色。
而到了20–90kHz区域则是深紫色/蓝色,几乎无信号。
如果还有疑问,这里还有一张更加高清的频谱分析图:
sox "1.SWEET×SWEET.flac" -n spectrogram -o "high_res_spectrogram.png" -x 4000 -y 1200
这是很明显的后期人为超分或者算法AI超分插值上去的痕迹,所以没有必要保留这些无用的数据,可以安全地降低回CD无损音质。
利用脚本转换文件节省空间
因为这些媒体文件确实占用不小,占用了我NAS 160G的储存空间,问AI说至少可以压缩60%-70%的空间,所以我决定整顿这个媒体库。
把下面的代码保存为convert.py,记得修改 原媒体库路径 和 新媒体库路径。
#!/usr/bin/env python3
import os
import shutil
import subprocess
from pathlib import Path
from mutagen import File as MutagenFile
# ================== 配置区域(修改你的路径) ==================
SOURCE = Path("/home/chocola/data/files/Music/") # 原媒体库路径(结尾必须加 /)
TARGET = Path("/mnt/4T-RAID1/data/music_01/") # 新媒体库路径(结尾必须加 /)
# ============================================================
# 确保路径以斜杠结尾
if str(SOURCE)[-1] != '/': SOURCE = SOURCE / ''
if str(TARGET)[-1] != '/': TARGET = TARGET / ''
# 验证路径
if not SOURCE.exists():
raise FileNotFoundError(f"源路径不存在: {SOURCE}")
TARGET.mkdir(parents=True, exist_ok=True)
def get_sample_rate(file_path):
"""安全获取FLAC文件采样率(双保险:mutagen + mediainfo)"""
# 尝试用mutagen
try:
audio = MutagenFile(file_path)
if audio and 'audio' in audio:
return audio.info.sample_rate
except Exception:
pass # mutagen失败,尝试mediainfo
# 尝试用mediainfo(关键修复!)
try:
result = subprocess.run(
['mediainfo', '--Inform=Audio;%SamplingRate%', str(file_path)],
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL,
check=True
)
rate_str = result.stdout.decode().strip()
# 处理不同格式:如 "48000" 或 "48.0 kHz"
if 'kHz' in rate_str:
rate_str = rate_str.replace(' kHz', '').replace('.', '')
return int(rate_str)
except Exception as e:
print(f"⚠ 无法获取采样率 (file: {file_path}, error: {e})")
return 0 # 0表示无效采样率
def is_high_resolution_flac(file_path):
"""检查FLAC是否为高采样率 (>44.1kHz)"""
sample_rate = get_sample_rate(file_path)
if sample_rate <= 0:
print(f"⚠ 跳过文件 (无效采样率): {file_path}")
return False
print(f" [采样率] {file_path}: {sample_rate} Hz") # 调试日志
return sample_rate > 44100 # >44.1kHz
def convert_flac(file_path):
"""安全转换FLAC文件(100%避免ffmpeg交互)"""
rel_path = file_path.relative_to(SOURCE)
target_path = TARGET / rel_path
# 确保目标目录存在
target_path.parent.mkdir(parents=True, exist_ok=True)
# 用mutagen安全提取元数据
audio = MutagenFile(file_path)
title = audio.get('title', [os.path.splitext(file_path.name)[0]])[0]
artist = audio.get('artist', [''])[0]
album = audio.get('album', [''])[0]
# 使用ffmpeg安全转码
cmd = [
'ffmpeg',
'-y',
'-i', str(file_path),
'-map', '0:a:0',
'-c:a', 'flac',
'-sample_fmt', 's16',
'-ar', '44100',
'-metadata', f'title={title}',
'-metadata', f'artist={artist}',
'-metadata', f'album={album}',
str(target_path)
]
try:
subprocess.run(
cmd,
check=True,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL
)
print(f"✅ 转码完成: {rel_path} (原采样率: {get_sample_rate(file_path)}Hz)")
except subprocess.CalledProcessError as e:
print(f"❌ 转码失败 (回退到原文件): {file_path}")
shutil.copy2(file_path, target_path)
def copy_file(file_path):
"""安全复制非FLAC文件"""
rel_path = file_path.relative_to(SOURCE)
target_path = TARGET / rel_path
target_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(file_path, target_path)
print(f"✅ 复制: {rel_path}")
def main():
print(f"开始处理媒体库: {SOURCE} → {TARGET}")
print("⚠ 正在处理特殊字符文件名 (日语/中文)...")
for root, _, files in os.walk(SOURCE):
for file in files:
file_path = Path(root) / file
if file.startswith('.'): # 跳过隐藏文件
continue
if file.lower().endswith('.flac'):
if is_high_resolution_flac(file_path):
convert_flac(file_path)
else:
copy_file(file_path)
elif file.lower().endswith(('.mp3', '.lrc', '.jpg', '.png')):
copy_file(file_path)
print("\n🎉 操作完成!新库已生成到:", TARGET)
print(" - 伪高解析FLAC (48kHz+/96kHz/192kHz) 已转码为CD质量")
print(" - CD质量FLAC/MP3/辅助文件已原样复制")
if __name__ == "__main__":
try:
main()
except Exception as e:
print(f"❌ 紧急错误: {e}")
exit(1)执行完成后,媒体库直接从160G变成49G,瞬间舒服了。
不过话说回来,我今天打算尝试登陆网易云看看什么情况,居然说我的账户可能被运营商二次放号,确认是我的账号后,发现居然解封了。之前是一直卡着说要我刷脸实名认证,我不想搞,一气之下自建媒体库,不知不觉就用了一年了。不过后面我也不会再用网易云了,维护好自己的媒体库就行了,省得动不动自己的歌单又有一片变成VIP曲子听不了搞得我红温(









































































































































































































