Python实现网易云音乐自动抓取整理

事情的缘起很简单,我的jellyfin音乐库听腻的差不多了,想抓点新歌。虽然说之前已经抓取过1000多首歌曲,但是很多是我因为强迫症,为了媒体库的完整性,把整个专辑都抓下来了,实际上常听的就是两三百首。
正因如此,我就想给我的媒体库补充点库存,于是开始折腾,有了这篇文章。

实现思路

上次因为没能自动化搞下载,且因为心急,所以几乎是半自动半手动收集的,费时费力,花费了差不多一个星期。而这次,我决定使用Python自动化完成,下面是工作流实现思路:

  • 当遇到喜欢的音乐后,我就保存其网易云的URL到文本文件,例如https://music.163.com/#/song?id=1499823984你可以点开看看是什么曲子(,每个曲子放一行,随听随收集。
  • 读取这个歌曲列表文件,逐行解析,获取专辑的ID,例如https://music.163.com/#/album?id=99010037,取出99010037,以及歌手KOTOKO
  • 然后去解析专辑的页面,获取其中所有歌曲的song?id,例如1499823984149982681414998239751499826811
  • 调用第三方API进行歌曲解析,获取音频文件,歌词和封面。例如这次我用的就是https://jx.chksz.top/
  • 按照jellyfin要求的格式保存歌曲文件,也就是歌手/专辑/01.歌曲.flac,如:

    [chocola@Neko-Server KOTOKO]$ tree
    .
    └── ネコぱら vol.4 OP ⁄ ED
      ├── 01.SWEET×SWEET.flac
      ├── 01.SWEET×SWEET.lrc
      ├── 02.NEGAIGOTO.flac
      ├── 02.NEGAIGOTO.lrc
      ├── 03.SWEET×SWEET (Instrumental).flac
      ├── 03.SWEET×SWEET (Instrumental).lrc
      ├── 04.NEGAIGOTO (Instrumental).flac
      ├── 04.NEGAIGOTO (Instrumental).lrc
      └── cover.jpg
  • 最后还要打上艺术家,专辑,歌曲名的标签

既然确定了思路和方法,接下来开始逐个攻破吧!

观察网页布局

打开音乐网页,开启F12大法,检查元素,迅速就定位到了目标:
01.png
其实,后面发现在页面的meta标签其实也有信息:
02.png
对于专辑部分,抓取列表就好了:
03.png
看起来并不难,使用Python的requests模块获取html,然后解析内容就好了。

尝试实现爬虫逻辑

和LLM友好交流后,得出了每一部分的代码逻辑函数,例如解析歌曲页面信息的:

def extract_music_info(url):
    # 模拟浏览器请求头
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    try:
        # 获取HTML内容
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        html = response.text
        print(html)

        # 提取艺术家(og:music:artist)
        artist_match = re.search(r'<meta property="og:music:artist" content="([^"]+)"', html)
        artist = artist_match.group(1) if artist_match else None

        # 提取专辑名称(og:music:album)
        album_name_match = re.search(r'<meta property="og:music:album" content="([^"]+)"', html)
        album_name = album_name_match.group(1) if album_name_match else None

        # 提取专辑ID(从music:album的URL中解析)
        album_url_match = re.search(r'<meta property="music:album" content="([^"]+)"', html)
        album_id = None
        if album_url_match:
            url_content = album_url_match.group(1)
            id_match = re.search(r'id=(\d+)', url_content)
            album_id = id_match.group(1) if id_match else None

        # 检查是否所有字段都提取成功
        if artist and album_name and album_id:
            return (artist, album_name, album_id)
        else:
            return None

    except Exception as e:
        print(f"提取失败: {str(e)}")
        return None

但是好巧不巧,我写完代码的第二天,正准备要爬取,却返回了空值。我就纳闷了,我昨天还能运行的代码,怎么今天就出问题了?然后重新检查了页面,才发现第二天后网易云把页面嵌入了iframe标签,让我的代码全部报废(
还真是赶到点子上了呢,由于这部分加载实现需要模拟出来浏览器逻辑,所以我决定还是上selenium吧,尽管我并不喜欢这种臃肿低效的爬虫方式,但是现在看来如果想快速完成任务只能这样了。
iframe标签反爬这招太狠了!

selenium 搭配 firefox 大战 iframe

配置和熟悉selenium

首先,安装selenium:

pip install selenium

然后要去下载selenium的firefox驱动,前往github.com/mozilla/geckodriver/releases下载对应的版本,这里我下载的是geckodriver-v0.36.0-linux64.tar.gz,下载好后放在项目文件夹下。
要用selenium首先要熟悉基本的用法,以下是简单示例:

from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options  # 新增导入

# 创建 Service 对象,指定 geckodriver 的路径
service = Service(executable_path='./geckodriver')

# 创建 Options 对象,指定 Firefox 的二进制路径
options = Options()
options.binary_location = "/opt/firefox-esr/firefox"  # 替换为你的实际路径

# 初始化 Firefox 驱动,同时传入 service 和 options
driver = webdriver.Firefox(service=service, options=options)

driver.get('https://www.nekopara.uk/')
print(driver.page_source)
driver.quit()

大概就是获取了页面渲染后的html代码然后返回。

进行爬取

下面是爬取的实现函数,由于如果不关闭浏览器的话,获取的iframe标签还是第一次打开页面的内容,没办法复用,只能每次爬取完成后关闭浏览器然后重新打开。

def extract_music_info(url):
    # 创建 Service 对象
    service = Service(executable_path='./geckodriver')
    # 创建 Options 对象
    options = Options()
    options.binary_location = "/opt/firefox-esr/firefox"  # 替换为你的实际路径
    # 初始化 Firefox 驱动
    driver = webdriver.Firefox(service=service, options=options)
    try:
        # 访问目标页面
        driver.get(url)
        # 等待iframe加载完成
        iframe = WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.ID, "g_iframe"))
        )
        # 切换到iframe
        driver.switch_to.frame(iframe)
        # 等待iframe内部内容加载(等待10秒)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )
        # 获取iframe中的完整HTML
        html = driver.page_source
        # 提取艺术家(og:music:artist)
        artist_match = re.search(r'<meta property="og:music:artist" content="([^"]+)"', html)
        artist = artist_match.group(1) if artist_match else None

        # 提取专辑名称(og:music:album)
        album_name_match = re.search(r'<meta property="og:music:album" content="([^"]+)"', html)
        album_name = album_name_match.group(1) if album_name_match else None

        # 提取专辑ID(从music:album的URL中解析)
        album_url_match = re.search(r'<meta property="music:album" content="([^"]+)"', html)
        album_id = None
        if album_url_match:
            url_content = album_url_match.group(1)
            id_match = re.search(r'id=(\d+)', url_content)
            album_id = id_match.group(1) if id_match else None

        # 检查是否所有字段都提取成功
        if artist and album_name and album_id:
            return (artist, album_name, album_id)
        else:
            return None

    except Exception as e:
        print(f"❌ 处理失败: {str(e)}")
    finally:
        driver.quit()

def get_album_info(url):
    # 创建 Service 对象
    service = Service(executable_path='./geckodriver')
    # 创建 Options 对象
    options = Options()
    options.binary_location = "/opt/firefox-esr/firefox"  # 替换为你的实际路径
    # 初始化 Firefox 驱动
    driver = webdriver.Firefox(service=service, options=options)
    try:
        # 访问目标页面
        driver.get(url)
        # 等待iframe加载完成
        iframe = WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.ID, "g_iframe"))
        )
        # 切换到iframe
        driver.switch_to.frame(iframe)
        # 等待iframe内部内容加载(等待10秒)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )
        # 获取iframe中的完整HTML
        iframe_html = driver.page_source
        soup = BeautifulSoup(iframe_html, 'html.parser')
        song_dict = {}

        # 遍历所有歌曲行(tr元素)
        for tr in soup.find_all('tr'):
            # 提取页面序号(在<span class="num">中)
            num_span = tr.find('span', class_='num')
            if not num_span:
                continue

            try:
                page_num = int(num_span.get_text(strip=True))
            except ValueError:
                continue

            # 提取歌曲ID(在<a href="/song?id=XXX">中)
            a_tag = tr.find('a', href=lambda href: href and href.startswith('/song?id='))
            if not a_tag:
                continue

            href = a_tag['href']
            song_id = href.split('=')[1]
            song_dict[page_num] = song_id

        return song_dict
    except Exception as e:
        print(f"❌ 处理失败: {str(e)}")
    finally:
        driver.quit()

通过driver.switch_to.frame(iframe)才能获取到iframe内容,不然就和直接用python的requests模块的请求结果是一样的。
这样我们就能获取到了艺术家名称,专辑名称,歌曲ID,可以进行下一步工作了。
确实感觉自动化浏览器访问这个是爬虫的终极手段了,几乎什么都可以爬取了,只不过效率确实不如直接发送请求。

从第三方解析服务API获取信息

这里我用了https://jx.chksz.top/这个网站的解析服务,是在搜索引擎里面随便找的。简单看了一下,好像是一个初中生的,真是年少有为呢!
一番F12后,抓取到了API的地址:https://api.kxzjoker.cn/api/163_music
04.png
05.png
接下来就是构造请求并下载的函数了:

def download_song_and_metadata(save_path:str,song_url: str,number_info:str, level: str = "jymaster") -> bool:
    # === 第一步:请求你的解析 API ===
    api_endpoint = "https://api.kxzjoker.cn/api/163_music"
    form_data = {
        "url": song_url,
        "level": level,
        "type": "json"
    }

    try:
        print("正在请求歌曲信息...")
        resp = requests.post(api_endpoint, data=form_data, timeout=15)
        resp.raise_for_status()
        data = resp.json()

        if data.get("status") != 200:
            print(f"API 返回错误状态: {data.get('status')}")
            return False

        # 提取字段
        title = sanitize_filename(data["name"])
        artist = data["ar_name"]
        album = sanitize_filename(data["al_name"])
        audio_url = data["url"]
        cover_url = data["pic"]
        lyric = data.get("lyric", "")
        tlyric = data.get("tlyric", "")

        # === 第二步:创建专辑目录 ===
        album_dir = os.path.join(save_path, album)
        os.makedirs(album_dir, exist_ok=True)

        # === 第三步:下载音频文件 ===
        audio_ext = ".flac" if ".flac" in audio_url else ".mp3"
        audio_path = os.path.join(album_dir, f"{number_info}.{title}{audio_ext}")

        if not os.path.exists(audio_path):
            print(f"正在下载音频: {title}{audio_ext}")
            audio_resp = requests.get(audio_url, stream=True, timeout=30)
            audio_resp.raise_for_status()
            with open(audio_path, "wb") as f:
                for chunk in audio_resp.iter_content(chunk_size=8192):
                    f.write(chunk)
        else:
            print(f"音频已存在,跳过: {audio_path}")

        # === 第四步:合并歌词并保存 .lrc ===
        lrc_path = os.path.join(album_dir, f"{number_info}.{title}.lrc")
        if tlyric != None:
            combined_lyric = merge_lyrics(lyric, tlyric)
        else:
            combined_lyric = lyric
        with open(lrc_path, "w", encoding="utf-8") as f:
            f.write(combined_lyric)
        print(f"歌词已保存: {lrc_path}")

        # === 第五步:下载封面为 folder.jpg ===
        cover_path = os.path.join(album_dir, "folder.jpg")
        if not os.path.exists(cover_path):
            print("正在下载专辑封面...")
            cover_resp = requests.get(cover_url, timeout=15)
            cover_resp.raise_for_status()
            with open(cover_path, "wb") as f:
                f.write(cover_resp.content)
            print(f"封面已保存: {cover_path}")
        else:
            print("封面已存在,跳过")

        print(f"\n✅ 下载完成!路径: {album_dir}")
        return True

    except Exception as e:
        print(f"❌ 下载失败: {e}")
        return False

def merge_lyrics(original: str, translation: str) -> str:
    """
    合并原文和翻译歌词,生成双语 LRC。
    假设时间戳一致,将翻译放在下一行。
    """
    orig_lines = original.strip().splitlines()
    trans_dict = {}

    # 解析翻译歌词的时间 -> 文本
    for line in translation.strip().splitlines():
        if "]:" in line or line.startswith("[by:"):
            continue
        parts = line.split("]", 1)
        if len(parts) == 2:
            time_tag, text = parts[0] + "]", parts[1].strip()
            if text:
                trans_dict[time_tag] = text

    # 合并
    merged = []
    for line in orig_lines:
        merged.append(line)
        if line.startswith("[") and "]" in line:
            time_tag = line.split("]", 1)[0] + "]"
            if time_tag in trans_dict:
                merged.append(time_tag + trans_dict[time_tag])

    return "\n".join(merged)

最终缝合

得到了各个部分的组件,然后就是缝合工作了,只需要传入一个分行的歌曲的URL文本文件,程序就可以全自动下载全部的相关文件,并且根据存放规范保存。
下面是最终缝合代码:

import requests
import re
from urllib.parse import urlparse
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options  # 新增导入
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
from bs4 import BeautifulSoup

def get_album_info(url):
    # 创建 Service 对象
    service = Service(executable_path='./geckodriver')
    # 创建 Options 对象
    options = Options()
    options.binary_location = "/opt/firefox-esr/firefox"  # 替换为你的实际路径
    # 初始化 Firefox 驱动
    driver = webdriver.Firefox(service=service, options=options)
    try:
        # 访问目标页面
        driver.get(url)
        # 等待iframe加载完成
        iframe = WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.ID, "g_iframe"))
        )
        # 切换到iframe
        driver.switch_to.frame(iframe)
        # 等待iframe内部内容加载(等待10秒)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )
        # 获取iframe中的完整HTML
        iframe_html = driver.page_source
        soup = BeautifulSoup(iframe_html, 'html.parser')
        song_dict = {}

        # 遍历所有歌曲行(tr元素)
        for tr in soup.find_all('tr'):
            # 提取页面序号(在<span class="num">中)
            num_span = tr.find('span', class_='num')
            if not num_span:
                continue

            try:
                page_num = int(num_span.get_text(strip=True))
            except ValueError:
                continue

            # 提取歌曲ID(在<a href="/song?id=XXX">中)
            a_tag = tr.find('a', href=lambda href: href and href.startswith('/song?id='))
            if not a_tag:
                continue

            href = a_tag['href']
            song_id = href.split('=')[1]
            song_dict[page_num] = song_id

        return song_dict
    except Exception as e:
        print(f"❌ 处理失败: {str(e)}")
    finally:
        driver.quit()

def sanitize_filename(filename: str) -> str:
    """移除文件名中的非法字符"""
    return re.sub(r'[<>:"/\\|?*\x00-\x1f]', '_', filename).strip()

def download_song_and_metadata(save_path:str,song_url: str,number_info:str, level: str = "jymaster") -> bool:
    # === 第一步:请求你的解析 API ===
    api_endpoint = "https://api.kxzjoker.cn/api/163_music"
    form_data = {
        "url": song_url,
        "level": level,
        "type": "json"
    }

    try:
        print("正在请求歌曲信息...")
        resp = requests.post(api_endpoint, data=form_data, timeout=15)
        resp.raise_for_status()
        data = resp.json()

        if data.get("status") != 200:
            print(f"API 返回错误状态: {data.get('status')}")
            return False

        # 提取字段
        title = sanitize_filename(data["name"])
        artist = data["ar_name"]
        album = sanitize_filename(data["al_name"])
        audio_url = data["url"]
        cover_url = data["pic"]
        lyric = data.get("lyric", "")
        tlyric = data.get("tlyric", "")

        # === 第二步:创建专辑目录 ===
        album_dir = os.path.join(save_path, album)
        os.makedirs(album_dir, exist_ok=True)

        # === 第三步:下载音频文件 ===
        audio_ext = ".flac" if ".flac" in audio_url else ".mp3"
        audio_path = os.path.join(album_dir, f"{number_info}.{title}{audio_ext}")

        if not os.path.exists(audio_path):
            print(f"正在下载音频: {title}{audio_ext}")
            audio_resp = requests.get(audio_url, stream=True, timeout=30)
            audio_resp.raise_for_status()
            with open(audio_path, "wb") as f:
                for chunk in audio_resp.iter_content(chunk_size=8192):
                    f.write(chunk)
        else:
            print(f"音频已存在,跳过: {audio_path}")

        # === 第四步:合并歌词并保存 .lrc ===
        lrc_path = os.path.join(album_dir, f"{number_info}.{title}.lrc")
        if tlyric != None:
            combined_lyric = merge_lyrics(lyric, tlyric)
        else:
            combined_lyric = lyric
        with open(lrc_path, "w", encoding="utf-8") as f:
            f.write(combined_lyric)
        print(f"歌词已保存: {lrc_path}")

        # === 第五步:下载封面为 folder.jpg ===
        cover_path = os.path.join(album_dir, "folder.jpg")
        if not os.path.exists(cover_path):
            print("正在下载专辑封面...")
            cover_resp = requests.get(cover_url, timeout=15)
            cover_resp.raise_for_status()
            with open(cover_path, "wb") as f:
                f.write(cover_resp.content)
            print(f"封面已保存: {cover_path}")
        else:
            print("封面已存在,跳过")

        print(f"\n✅ 下载完成!路径: {album_dir}")
        return True

    except Exception as e:
        print(f"❌ 下载失败: {e}")
        return False

def merge_lyrics(original: str, translation: str) -> str:
    """
    合并原文和翻译歌词,生成双语 LRC。
    假设时间戳一致,将翻译放在下一行。
    """
    orig_lines = original.strip().splitlines()
    trans_dict = {}

    # 解析翻译歌词的时间 -> 文本
    for line in translation.strip().splitlines():
        if "]:" in line or line.startswith("[by:"):
            continue
        parts = line.split("]", 1)
        if len(parts) == 2:
            time_tag, text = parts[0] + "]", parts[1].strip()
            if text:
                trans_dict[time_tag] = text

    # 合并
    merged = []
    for line in orig_lines:
        merged.append(line)
        if line.startswith("[") and "]" in line:
            time_tag = line.split("]", 1)[0] + "]"
            if time_tag in trans_dict:
                merged.append(time_tag + trans_dict[time_tag])

    return "\n".join(merged)

def extract_music_info(url):
    # 创建 Service 对象
    service = Service(executable_path='./geckodriver')
    # 创建 Options 对象
    options = Options()
    options.binary_location = "/opt/firefox-esr/firefox"  # 替换为你的实际路径
    # 初始化 Firefox 驱动
    driver = webdriver.Firefox(service=service, options=options)
    try:
        # 访问目标页面
        driver.get(url)
        # 等待iframe加载完成
        iframe = WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.ID, "g_iframe"))
        )
        # 切换到iframe
        driver.switch_to.frame(iframe)
        # 等待iframe内部内容加载(等待10秒)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )
        # 获取iframe中的完整HTML
        html = driver.page_source
        # 提取艺术家(og:music:artist)
        artist_match = re.search(r'<meta property="og:music:artist" content="([^"]+)"', html)
        artist = artist_match.group(1) if artist_match else None

        # 提取专辑名称(og:music:album)
        album_name_match = re.search(r'<meta property="og:music:album" content="([^"]+)"', html)
        album_name = album_name_match.group(1) if album_name_match else None

        # 提取专辑ID(从music:album的URL中解析)
        album_url_match = re.search(r'<meta property="music:album" content="([^"]+)"', html)
        album_id = None
        if album_url_match:
            url_content = album_url_match.group(1)
            id_match = re.search(r'id=(\d+)', url_content)
            album_id = id_match.group(1) if id_match else None

        # 检查是否所有字段都提取成功
        if artist and album_name and album_id:
            return (artist, album_name, album_id)
        else:
            return None

    except Exception as e:
        print(f"❌ 处理失败: {str(e)}")
    finally:
        driver.quit()

def download_album_from_song(url):
    info_result = extract_music_info(url)
    if info_result:
        print(f"艺术家: {info_result[0]}")
        print(f"专辑名称: {info_result[1]}")
        print(f"专辑ID: {info_result[2]}")
    else:
        print("提取失败,检查URL或页面结构")

    download_path = str(os.getcwd()+"/"+info_result[0])
    songid_s = get_album_info("https://music.163.com/#/album?id="+info_result[2])
    for number_info,songid_in in songid_s.items():
        to_download_url = "https://music.163.com/song?id="+songid_in
        download_song_and_metadata(download_path,to_download_url,str(number_info), level="jymaster")

def extract_songlist(filename: str):
    """
    主函数:读取指定文件并逐行处理。
    :param filename: 要读取的 txt 文件名(位于当前目录)
    """
    # 获取当前工作目录
    current_dir = os.getcwd()
    file_path = os.path.join(current_dir, filename)

    # 检查文件是否存在
    if not os.path.isfile(file_path):
        print(f"错误:文件 '{filename}' 不存在于当前目录 ({current_dir})")
        return

    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            for line_number, line in enumerate(file, start=1):
                # 可选:跳过空行
                if line.strip() == '':
                    continue
                try:
                    download_album_from_song(line)
                except Exception as e:
                    print(f"处理第 {line_number} 行时出错: {e}")
    except Exception as e:
        print(f"读取文件时发生错误: {e}")

# 使用示例
if __name__ == "__main__":
    txt_filename = "collect_list.txt"
    extract_songlist(txt_filename)

装好需要的Python模块后,只需要在当前目录保存好collect_list.txt,放置浏览器驱动geckodriver,确保路径都正确,就可以Python启动了。过一会就会在当前目录获得想要的专辑文件了。

整理音乐文件标签

由于下载的flac缺少元数据,所以需要根据路径和文件名进行补全。这时候我也意识到我之前爬取的那一批媒体库文件并没有打好标签,导致jellyfin的展示效果还是有一定混乱。
下面这个Python程序将会修复这个问题

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
将文件名中的曲号与曲名写入音频文件元数据(支持 FLAC/MP3/M4A)
用法示例:
  python3 write_title_and_track.py /home/chocola/data/files/Music --dry-run
  python3 write_title_and_track.py /home/chocola/data/files/Music --force
"""
import os, sys, re, argparse
from pathlib import Path

# try import mutagen, auto-install if missing (user mode)
try:
    from mutagen import File
    from mutagen.flac import FLAC
    from mutagen.id3 import ID3, TIT2, TRCK, ID3NoHeaderError
    from mutagen.mp4 import MP4, MP4Tags
except Exception:
    print("mutagen 未检测到,尝试安装(用户模式 pip3 install --user mutagen)...")
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--user", "mutagen"])
    from mutagen import File
    from mutagen.flac import FLAC
    from mutagen.id3 import ID3, TIT2, TRCK, ID3NoHeaderError
    from mutagen.mp4 import MP4, MP4Tags

AUDIO_EXTS = ('.flac', '.mp3', '.m4a', '.mp4', '.aac', '.ogg', '.opus')  # will skip unknown handlers

# regex to extract track number and title
# matches: "01. Title", "1 - Title", "01 Title", "01_ Title", "01.Title", leading zeros allowed
_leading_track_re = re.compile(r'^\s*(\d{1,3})\s*[\.\-_\s:]+\s*(.+)$')
# fallback: "01Title" (no separator)
_leading_track_nosep_re = re.compile(r'^\s*(\d{1,3})(.+)$')

# normalize fullwidth parentheses and brackets to ascii
def normalize_parens(s: str) -> str:
    return (s
            .replace('(', '(').replace(')', ')')
            .replace('[', '[').replace(']', ']')
            .replace('【', '[').replace('】', ']')
            )

def parse_filename_to_track_title(fn: str):
    name = fn
    # remove extension already handled by caller usually
    name = name.strip()
    name = normalize_parens(name)
    # try main regex
    m = _leading_track_re.match(name)
    if m:
        tr = m.group(1).lstrip('0') or '0'
        title = m.group(2).strip()
        return tr, title
    m2 = _leading_track_nosep_re.match(name)
    if m2:
        tr = m2.group(1).lstrip('0') or '0'
        title = m2.group(2).strip()
        # but if title is short (like only 1 char) and looks numeric, ignore
        return tr, title
    # No leading track detected: return (None, full name)
    return None, name

def write_flac(path: Path, artist: str, album: str, tracknum: str, title: str, dry_run=False, force=False):
    audio = FLAC(path)
    tags = audio.tags or {}
    changed = False
    # ensure artist/album set if available
    if artist and (not tags.get('artist') or tags.get('artist') != [artist]):
        if not dry_run:
            audio['artist'] = [artist]
        changed = True
    if album and (not tags.get('album') or tags.get('album') != [album]):
        if not dry_run:
            audio['album'] = [album]
        changed = True
    # title
    cur_title = tags.get('title', None)
    if (force or not cur_title) and title:
        if not dry_run:
            audio['title'] = [title]
        changed = True
    # tracknumber
    cur_track = tags.get('tracknumber', None)
    if tracknum and (force or not cur_track):
        if not dry_run:
            audio['tracknumber'] = [str(tracknum)]
        changed = True
    if changed and not dry_run:
        audio.save()
    return changed

def write_mp3(path: Path, artist: str, album: str, tracknum: str, title: str, dry_run=False, force=False):
    try:
        audio = ID3(path)
    except ID3NoHeaderError:
        audio = ID3()
    changed = False
    # artist / album frames: TPE1 / TALB
    if artist:
        cur = audio.getall('TPE1')
        cur_text = [f.text for f in cur] if cur else []
        if force or not cur_text:
            if not dry_run:
                audio.delall('TPE1')
                audio.add(TIT2(encoding=3, text=[])) if False else None  # noop to avoid pyflakes
                audio.add(ID3().get('TPE1')) if False else None
            # we'll set via frames below
    # set title
    cur_title = audio.getall('TIT2')
    if (force or not cur_title) and title:
        if not dry_run:
            audio.delall('TIT2')
            audio.add(TIT2(encoding=3, text=title))
        changed = True
    # set tracknumber TRCK
    cur_tr = audio.getall('TRCK')
    if tracknum and (force or not cur_tr):
        if not dry_run:
            audio.delall('TRCK')
            audio.add(TRCK(encoding=3, text=str(tracknum)))
        changed = True
    # artist/album set via text frames TPE1/TALB
    if artist:
        cur_a = audio.getall('TPE1')
        if (force or not cur_a):
            if not dry_run:
                audio.delall('TPE1')
                audio.add(ID3().get('TPE1')) if False else None
                from mutagen.id3 import TPE1, TALB
                audio.add(TPE1(encoding=3, text=artist))
            changed = True
    if album:
        cur_al = audio.getall('TALB')
        if (force or not cur_al):
            if not dry_run:
                from mutagen.id3 import TALB
                audio.delall('TALB')
                audio.add(TALB(encoding=3, text=album))
            changed = True
    if changed and not dry_run:
        audio.save(path)
    return changed

def write_mp4(path: Path, artist: str, album: str, tracknum: str, title: str, dry_run=False, force=False):
    audio = MP4(path)
    tags = audio.tags or MP4Tags()
    changed = False
    # artist
    if artist and (tags.get('\xa9ART') != [artist]):
        if not dry_run:
            tags['\xa9ART'] = [artist]
        changed = True
    # album
    if album and (tags.get('\xa9alb') != [album]):
        if not dry_run:
            tags['\xa9alb'] = [album]
        changed = True
    # title
    if (force or not tags.get('\xa9nam')) and title:
        if not dry_run:
            tags['\xa9nam'] = [title]
        changed = True
    # track number (trkn) is a tuple (track, total)
    if tracknum and (force or not tags.get('trkn')):
        try:
            tr = int(tracknum)
        except Exception:
            tr = None
        if tr:
            if not dry_run:
                tags['trkn'] = [(tr, 0)]
            changed = True
    if changed and not dry_run:
        audio.tags = tags
        audio.save()
    return changed

def process_file(path: Path, root: Path, dry_run=False, force=False):
    # derive artist/album from path: root/.../Artist/Album/file
    rel = path.relative_to(root)
    parts = rel.parts
    if len(parts) < 3:
        # not deep enough; skip
        return False, "path_too_shallow"
    album = parts[-2]
    artist = parts[-3]
    stem = path.stem  # name without extension
    # Normalize Unicode around parentheses
    stem = normalize_parens(stem)
    tracknum, title = parse_filename_to_track_title(stem)
    # if no tracknum found, leave tracknum None and title entire stem
    if tracknum is None:
        # remove trailing bracketed translations, e.g. "SongName (中文翻译)" -> keep whole as title
        title = stem.strip()
    # prefer title if present
    # write depending on extension
    ext = path.suffix.lower()
    try:
        if ext == '.flac':
            changed = write_flac(path, artist, album, tracknum, title, dry_run=dry_run, force=force)
        elif ext == '.mp3':
            changed = write_mp3(path, artist, album, tracknum, title, dry_run=dry_run, force=force)
        elif ext in ('.m4a', '.mp4'):
            changed = write_mp4(path, artist, album, tracknum, title, dry_run=dry_run, force=force)
        else:
            # unsupported ext -> try generic mutagen write as fallback
            audio = File(path)
            if audio is None:
                return False, "unsupported_format"
            # best-effort: set common tags
            changed = False
            if hasattr(audio, 'tags'):
                if force or not audio.tags.get('title'):
                    if not dry_run:
                        audio.tags['title'] = title
                    changed = True
                if force or not audio.tags.get('artist'):
                    if not dry_run:
                        audio.tags['artist'] = artist
                    changed = True
                if force or not audio.tags.get('album'):
                    if not dry_run:
                        audio.tags['album'] = album
                    changed = True
                if tracknum and (force or not audio.tags.get('tracknumber')):
                    if not dry_run:
                        audio.tags['tracknumber'] = str(tracknum)
                    changed = True
                if changed and not dry_run:
                    audio.save()
            return changed, "done"
        return changed, "done"
    except Exception as e:
        return False, f"error:{e}"

def main():
    p = argparse.ArgumentParser(description="从文件名写入 title 与 track 到音频元数据(支持 FLAC/MP3/M4A)")
    p.add_argument("root", help="音乐库根目录(会递归)")
    p.add_argument("--dry-run", action="store_true", help="仅预览,不写入")
    p.add_argument("--force", action="store_true", help="强制覆盖已有标签")
    p.add_argument("--limit", type=int, default=0, help="仅处理前 N 个文件(测试用)")
    args = p.parse_args()
    root = Path(args.root).resolve()
    if not root.is_dir():
        print("错误:root 不是目录:", root); sys.exit(1)
    total = 0
    changed_cnt = 0
    for dirpath, dirnames, filenames in os.walk(root):
        for fn in filenames:
            if not fn.lower().endswith(AUDIO_EXTS):
                continue
            total += 1
            path = Path(dirpath) / fn
            changed, reason = process_file(path, root, dry_run=args.dry_run, force=args.force)
            if changed:
                changed_cnt += 1
                print(("DRY " if args.dry_run else "") + f"修改: {path} ({reason})")
            # optional limit
            if args.limit and total >= args.limit:
                break
        if args.limit and total >= args.limit:
            break
    print(f"完成: 总文件={total}, 修改={changed_cnt}")

if __name__ == "__main__":
    main()

直接把音乐库路径传入进去执行即可,例如:

python tag_music.py /run/media/chocola/6T-DATA/Projects/Get_Information/Fuck_Netease/Music/

注意,目录下的存放格式要和之前说的一样,也是就是艺术家/专辑/01.歌曲名.flac
执行完成后,你就会得到一个自己的媒体库啦!
但是,这样就结束了,吗?

揭露网易云伪母带骗局

如果你有留意到的话,你应该会发现我抓取的歌曲参数,选择的是最高音质,也就是jymaster超清母带。
母带的定义是什么呢?

母带(Master)是音乐制作流程中的最终成品音频文件,用于后续复制、发行和流媒体分发。它是在混音完成后,通过母带处理(Mastering)对整体音频进行精细调整,包括均衡、压缩、限幅、动态控制等,以确保在各种播放设备上具有一致的音质表现,并符合行业标准的响度与格式要求。母带不仅是物理或数字发行的源文件,也代表了作品最终的声音面貌,是连接创作与听众的关键环节。高质量的母带能提升音乐的整体清晰度、空间感和专业度。

看着好像说很好啊,但是实际上,一般情况下发行的音乐文件大部分都是CD导出的,也就是平时我们能接触到的音质天花板就是CD无损音质16-bit/44.1kHz。而网易云平台上的所谓母带,绝大部分都是基于CD音质后期超上去的,保存这些庞大的文件不但对听感没有太大提升,反而还更加占用储存空间。这也是和群里面玩耳机的群友交流才得知的信息。
如果不信,你可以安装个sox分析一下音频频谱图。

sox 1.SWEET×SWEET.flac -n spectrogram -o ./infox.png

06.png
可以很明显的发现能量主要集中在0–20kHz区域,呈现黄白色/橙色/红色。
而到了20–90kHz区域则是深紫色/蓝色,几乎无信号。
如果还有疑问,这里还有一张更加高清的频谱分析图:

sox "1.SWEET×SWEET.flac" -n spectrogram   -o "high_res_spectrogram.png"  -x 4000 -y 1200

07.png
这是很明显的后期人为超分或者算法AI超分插值上去的痕迹,所以没有必要保留这些无用的数据,可以安全地降低回CD无损音质。

利用脚本转换文件节省空间

因为这些媒体文件确实占用不小,占用了我NAS 160G的储存空间,问AI说至少可以压缩60%-70%的空间,所以我决定整顿这个媒体库。
把下面的代码保存为convert.py,记得修改 原媒体库路径 和 新媒体库路径。

#!/usr/bin/env python3
import os
import shutil
import subprocess
from pathlib import Path
from mutagen import File as MutagenFile

# ================== 配置区域(修改你的路径) ==================
SOURCE = Path("/home/chocola/data/files/Music/")  # 原媒体库路径(结尾必须加 /)
TARGET = Path("/mnt/4T-RAID1/data/music_01/")  # 新媒体库路径(结尾必须加 /)
# ============================================================

# 确保路径以斜杠结尾
if str(SOURCE)[-1] != '/': SOURCE = SOURCE / ''
if str(TARGET)[-1] != '/': TARGET = TARGET / ''

# 验证路径
if not SOURCE.exists():
    raise FileNotFoundError(f"源路径不存在: {SOURCE}")
TARGET.mkdir(parents=True, exist_ok=True)

def get_sample_rate(file_path):
    """安全获取FLAC文件采样率(双保险:mutagen + mediainfo)"""
    # 尝试用mutagen
    try:
        audio = MutagenFile(file_path)
        if audio and 'audio' in audio:
            return audio.info.sample_rate
    except Exception:
        pass  # mutagen失败,尝试mediainfo

    # 尝试用mediainfo(关键修复!)
    try:
        result = subprocess.run(
            ['mediainfo', '--Inform=Audio;%SamplingRate%', str(file_path)],
            stdout=subprocess.PIPE,
            stderr=subprocess.DEVNULL,
            check=True
        )
        rate_str = result.stdout.decode().strip()
        # 处理不同格式:如 "48000" 或 "48.0 kHz"
        if 'kHz' in rate_str:
            rate_str = rate_str.replace(' kHz', '').replace('.', '')
        return int(rate_str)
    except Exception as e:
        print(f"⚠ 无法获取采样率 (file: {file_path}, error: {e})")
        return 0  # 0表示无效采样率

def is_high_resolution_flac(file_path):
    """检查FLAC是否为高采样率 (>44.1kHz)"""
    sample_rate = get_sample_rate(file_path)
    if sample_rate <= 0:
        print(f"⚠ 跳过文件 (无效采样率): {file_path}")
        return False
    print(f"  [采样率] {file_path}: {sample_rate} Hz")  # 调试日志
    return sample_rate > 44100  # >44.1kHz

def convert_flac(file_path):
    """安全转换FLAC文件(100%避免ffmpeg交互)"""
    rel_path = file_path.relative_to(SOURCE)
    target_path = TARGET / rel_path
    
    # 确保目标目录存在
    target_path.parent.mkdir(parents=True, exist_ok=True)
    
    # 用mutagen安全提取元数据
    audio = MutagenFile(file_path)
    title = audio.get('title', [os.path.splitext(file_path.name)[0]])[0]
    artist = audio.get('artist', [''])[0]
    album = audio.get('album', [''])[0]
    
    # 使用ffmpeg安全转码
    cmd = [
        'ffmpeg',
        '-y',
        '-i', str(file_path),
        '-map', '0:a:0',
        '-c:a', 'flac',
        '-sample_fmt', 's16',
        '-ar', '44100',
        '-metadata', f'title={title}',
        '-metadata', f'artist={artist}',
        '-metadata', f'album={album}',
        str(target_path)
    ]
    
    try:
        subprocess.run(
            cmd,
            check=True,
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL
        )
        print(f"✅ 转码完成: {rel_path} (原采样率: {get_sample_rate(file_path)}Hz)")
    except subprocess.CalledProcessError as e:
        print(f"❌ 转码失败 (回退到原文件): {file_path}")
        shutil.copy2(file_path, target_path)

def copy_file(file_path):
    """安全复制非FLAC文件"""
    rel_path = file_path.relative_to(SOURCE)
    target_path = TARGET / rel_path
    target_path.parent.mkdir(parents=True, exist_ok=True)
    shutil.copy2(file_path, target_path)
    print(f"✅ 复制: {rel_path}")

def main():
    print(f"开始处理媒体库: {SOURCE} → {TARGET}")
    print("⚠ 正在处理特殊字符文件名 (日语/中文)...")
    
    for root, _, files in os.walk(SOURCE):
        for file in files:
            file_path = Path(root) / file
            if file.startswith('.'):  # 跳过隐藏文件
                continue
                
            if file.lower().endswith('.flac'):
                if is_high_resolution_flac(file_path):
                    convert_flac(file_path)
                else:
                    copy_file(file_path)
            elif file.lower().endswith(('.mp3', '.lrc', '.jpg', '.png')):
                copy_file(file_path)
    
    print("\n🎉 操作完成!新库已生成到:", TARGET)
    print("   - 伪高解析FLAC (48kHz+/96kHz/192kHz) 已转码为CD质量")
    print("   - CD质量FLAC/MP3/辅助文件已原样复制")

if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        print(f"❌ 紧急错误: {e}")
        exit(1)

执行完成后,媒体库直接从160G变成49G,瞬间舒服了。

不过话说回来,我今天打算尝试登陆网易云看看什么情况,居然说我的账户可能被运营商二次放号,确认是我的账号后,发现居然解封了。之前是一直卡着说要我刷脸实名认证,我不想搞,一气之下自建媒体库,不知不觉就用了一年了。不过后面我也不会再用网易云了,维护好自己的媒体库就行了,省得动不动自己的歌单又有一片变成VIP曲子听不了搞得我红温(