Add support pour GPU (MPS and CUDA)

Migrate to `uv`
2025-03-28 12:58:39 +01:00
parent 4eb5f586d4
commit a8005cce50
6 changed files with 1336 additions and 45 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,6 +1,9 @@
 audio_summary_with_local_LLM.egg-info/
 .ruff_cache/
 # Virtual Env
 .venv
 # Local data
 .DS_Store
-tmp
+tmp
 summary.md
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
-# Audio Summary with local LLM
+# Audio Summary with Local LLM
-This tool is designed to provide a quick and concise summary of audio and video files. It supports summarizing content either from a local file or directly from YouTube. The tool uses Whisper for transcription and a local version of Mistral AI (Ollama) for generating summaries.
+This tool is designed to provide a quick and concise summary of audio and video files. It supports summarizing content either from a local file or directly from YouTube. The tool uses Whisper for transcription and a local version of Llama3 (via Ollama) for generating summaries.
 > [!TIP]  
 > It is possible to change the model you wish to use.
@@ -9,45 +9,47 @@ This tool is designed to provide a quick and concise summary of audio and video
 ## Features
 - **YouTube Integration**: Download and summarize content directly from YouTube.
- **Local File Support**: Summarize audio files available on your local disk.
+- **Local File Support**: Summarize audio/video files available on your local disk.
 - **Transcription**: Converts audio content to text using Whisper.
- **Summarization**: Generates a concise summary using Mistral AI (Ollama).
+- **Summarization**: Generates a concise summary using Llama3 (Ollama).
 - **Transcript Only Option**: Option to only transcribe the audio content without generating a summary.
 - **Device Optimization**: Automatically uses the best available hardware (MPS for Mac, CUDA for NVIDIA GPUs, or CPU).
 ## Prerequisites
 Before you start using this tool, you need to install the following dependencies:
- Python 3.8 or higher
+- Python 3.12 and lower than 3.13
- `pytube` for downloading videos from YouTube.
+- [Ollama](https://ollama.com) for LLM model management
- `pathlib` for local file handling
+- `ffmpeg` (required for audio processing)
- `openai-whisper` for audio transcription.
+- [uv](https://docs.astral.sh/uv/getting-started/installation/) for package management
 - [Ollama](https://ollama.com) for LLM model management.
 - `ffmpeg` (required for whisper)
 ## Installation
-### Python Requirements
+### Using uv
-Clone the repository and install the required Python packages:
+Clone the repository and install the required Python packages using [uv](https://github.com/astral-sh/uv):
 ```bash
 git clone https://github.com/damienarnodo/audio-summary-with-local-LLM.git
 cd audio-summary-with-local-LLM
-pip install -r src/requirements.txt
+
 # Create and activate a virtual environment with uv
 uv sync
 source .venv/bin/activate  # On Windows: .venv\Scripts\activate
 ```
 ### LLM Requirement
 [Download and install](https://ollama.com) Ollama to carry out LLM Management. More details about LLM models supported can be found on the Ollama [GitHub](https://github.com/ollama/ollama).
-Download and use the Mistral model:
+Download and use the Llama3 model:
 ```bash
-ollama pull mistral
+ollama pull llama3
 ## Test the access:
-ollama run mistral "tell me a joke"
+ollama run llama3 "tell me a joke"
 ```
 ## Usage
@@ -56,51 +58,175 @@ The tool can be executed with the following command line options:
 - `--from-youtube`: To download and summarize a video from YouTube.
 - `--from-local`: To load and summarize an audio or video file from the local disk.
- `--transcript-only`: To only transcribe the audio content without generating a summary. This option must be used with either `--from-youtube` or `--from-local`.
+- `--output`: Specify the output file path (default: ./summary.md)
 - `--transcript-only`: To only transcribe the audio content without generating a summary.
 - `--language`: Select the language to be used for the transcription (default: en)
 ### Examples
 1. **Summarizing a YouTube video:**
   ```bash
-   python src/summary.py --from-youtube <YouTube-Video-URL>
+   uv run python src/summary.py --from-youtube <YouTube-Video-URL>
   ```
 2. **Summarizing a local audio file:**
   ```bash
-   python src/summary.py --from-local <path-to-audio-file>
+   uv run python src/summary.py --from-local <path-to-audio-file>
   ```
 3. **Transcribing a YouTube video without summarizing:**
   ```bash
-   python src/summary.py --from-youtube <YouTube-Video-URL> --transcript-only
+   uv run python src/summary.py --from-youtube <YouTube-Video-URL> --transcript-only
   ```
 4. **Transcribing a local audio file without summarizing:**
   ```bash
-   python src/summary.py --from-local <path-to-audio-file> --transcript-only
+   uv run python src/summary.py --from-local <path-to-audio-file> --transcript-only
   ```
 5. **Specifying a custom output file:**
   ```bash
   uv run python src/summary.py --from-youtube <YouTube-Video-URL> --output my_summary.md
   ```
 The output summary will be saved in a markdown file in the specified output directory, while the transcript will be saved in the temporary directory.
 ## Output
-The summarized content is saved as a markdown file named `summary.md` in the current working directory. This file includes the transcribed text and its corresponding summary. If `--transcript-only` is used, only the transcription will be saved in the temporary directory.
+The summarized content is saved as a markdown file (default: `summary.md`) in the current working directory. This file includes a title and a concise summary of the content. The transcript is saved in the `tmp/transcript.txt` file.
 ## Hardware Acceleration
 The tool automatically detects and uses the best available hardware:
 - MPS (Metal Performance Shaders) for Apple Silicon Macs
 - CUDA for NVIDIA GPUs
 - Falls back to CPU when neither is available
 ### Handling Longer Audio Files
 This tool can process audio files of any length. For files longer than 30 seconds, the script automatically:
 1. Chunks the audio into manageable segments
 2. Processes each chunk separately
 3. Combines the results into a single transcript
 This approach allows for efficient processing of longer content while managing memory usage. However, be aware that:
 - Longer files will take proportionally more time to process
 - Very long files (>30 minutes) may require significant processing time, especially on CPU
 - For extremely long content, consider splitting the audio file into smaller segments before processing
 If you encounter memory issues with very long files, you can try:
 1. Using a smaller Whisper model by changing `WHISPER_MODEL` to "openai/whisper-base"
 2. Reducing the `chunk_length_s` parameter in the `transcribe_file` function
 3. Processing the file in separate parts and combining the summaries afterward
 ## Sources
 - [YouTube Video Summarizer with OpenAI Whisper and GPT](https://github.com/mirabdullahyaser/Summarizing-Youtube-Videos-with-OpenAI-Whisper-and-GPT-3/tree/master)
- [Mistral Python Client](https://github.com/mistralai/client-python)
+- [Ollama GitHub Repository](https://github.com/ollama/ollama)
- [Ollama : Installez LLama 2 et Code LLama en quelques secondes !](https://www.geeek.org/tutoriel-installation-llama-2-et-code-llama/)
+- [Transformers by Hugging Face](https://huggingface.co/docs/transformers/index)
 - [yt-dlp Documentation](https://github.com/yt-dlp/yt-dlp)
-## Known Issues
+## Troubleshooting
-```python
+### ffmpeg not found
-ValueError: Soundfile is either not in the correct format or is malformed. Ensure that the soundfile has a valid audio file extension (e.g. wav, flac or mp3) and is not corrupted. If reading from a remote URL, ensure that the URL is the full address to **download** the audio file.
+
 If you encounter this error::
 ```bash
 yt_dlp.utils.DownloadError: ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location
 ```
-To fix it :
+Please refer to [this post](https://www.reddit.com/r/StacherIO/wiki/ffmpeg/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
-`ffmpeg -i my_file.mp4 -movflags faststart my_file_fixed.mp4`
+
 ### Audio Format Issues
 If you encounter this error:
 ```bash
 ValueError: Soundfile is either not in the correct format or is malformed. Ensure that the soundfile has a valid audio file extension (e.g. wav, flac or mp3) and is not corrupted.
 ```
 Try converting your file with ffmpeg:
 ```bash
 ffmpeg -i my_file.mp4 -movflags faststart my_file_fixed.mp4
 ```
 ### Memory Issues on CPU
 If you're running on CPU and encounter memory issues during transcription, consider:
 1. Using a smaller Whisper model
 2. Processing shorter audio segments
 3. Ensuring you have sufficient RAM available
 ### Slow Transcription
 Transcription can be slow on CPU. For best performance:
 1. Use a machine with GPU or Apple Silicon (MPS)
 2. Keep audio files under 10 minutes when possible
 3. Close other resource-intensive applications
 ### Update the Whisper or LLM Model
 You can easily change the models used for transcription and summarization by modifying the variables at the top of the script:
 ```python
 # Default models
 OLLAMA_MODEL = "llama3"
 WHISPER_MODEL = "openai/whisper-large-v2"
 ```
 #### Changing the Whisper Model
 To use a different Whisper model for transcription:
 1. Update the `WHISPER_MODEL` variable with one of these options:
   - `"openai/whisper-tiny"` (fastest, least accurate)
   - `"openai/whisper-base"` (faster, less accurate)
   - `"openai/whisper-small"` (balanced)
   - `"openai/whisper-medium"` (slower, more accurate)
   - `"openai/whisper-large-v2"` (slowest, most accurate)
 2. Example:
   ```python
   WHISPER_MODEL = "openai/whisper-medium"  # A good balance between speed and accuracy
   ```
 For CPU-only systems, using a smaller model like `whisper-base` is recommended for better performance.
 #### Changing the LLM Model
 To use a different model for summarization:
 1. First, pull the desired model with Ollama:
   ```bash
   ollama pull mistral  # or any other supported model
   ```
 2. Then update the `OLLAMA_MODEL` variable:
   ```python
   OLLAMA_MODEL = "mistral"  # or any other model you've pulled
   ```
 3. Popular alternatives include:
   - `"llama3"` (default)
   - `"mistral"`
   - `"llama2"`
   - `"gemma:7b"`
   - `"phi"`
 For a complete list of available models, visit the [Ollama model library](https://ollama.com/library).
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,109 @@
 [project]
 name = "audio-summary-with-local-LLM"
 dynamic = ["version"]
 description = 'Sum up your local or remote files with a local LLM'
 keywords = ["audio", "summary", "local-llm", "ollama", "whisper"]
 readme = "README.md"
 requires-python = ">=3.12, <3.13"
 authors = [
    { name = "darnodo", email = "sepales.pret0h@icloud.com" },
 ]
 dependencies = [
    "ffmpeg>=1.4",
    "ollama>=0.4.7",
    "openai-whisper>=20240930",
    "torch>=2.6.0",
    "torchaudio>=2.6.0",
    "torchvision>=0.21.0",
    "transformers>=4.50.2",
    "yt-dlp>=2025.3.27",
 ]
 [tool.setuptools]
 py-modules = []
 [tool.ruff]
 # Exclude a variety of commonly ignored directories.
 exclude = [
    ".bzr",
    ".direnv",
    ".eggs",
    ".git",
    ".git-rewrite",
    ".hg",
    ".ipynb_checkpoints",
    ".mypy_cache",
    ".nox",
    ".pants.d",
    ".pyenv",
    ".pytest_cache",
    ".pytype",
    ".ruff_cache",
    ".svn",
    ".tox",
    ".venv",
    ".vscode",
    "__pypackages__",
    "_build",
    "buck-out",
    "build",
    "dist",
    "node_modules",
    "site-packages",
    "venv",
 ]
 # Same as Black.
 line-length = 88
 indent-width = 4
 # Assume Python 3.8
 target-version = "py38"
 [tool.ruff.lint]
 # Enable Pyflakes (`F`) and a subset of the pycodestyle (`E`)  codes by default.
 # Unlike Flake8, Ruff doesn't enable pycodestyle warnings (`W`) or
 # McCabe complexity (`C901`) by default.
 select = ["E4", "E7", "E9", "F"]
 ignore = []
 # Allow fix for all enabled rules (when `--fix`) is provided.
 fixable = ["ALL"]
 unfixable = []
 # Allow unused variables when underscore-prefixed.
 dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
 [tool.ruff.format]
 # Like Black, use double quotes for strings.
 quote-style = "double"
 # Like Black, indent with spaces, rather than tabs.
 indent-style = "space"
 # Like Black, respect magic trailing commas.
 skip-magic-trailing-comma = false
 # Like Black, automatically detect the appropriate line ending.
 line-ending = "auto"
 # Enable auto-formatting of code examples in docstrings. Markdown,
 # reStructuredText code/literal blocks and doctests are all supported.
 #
 # This is currently disabled by default, but it is planned for this
 # to be opt-out in the future.
 docstring-code-format = false
 # Set the line length limit used when formatting code snippets in
 # docstrings.
 #
 # This only has an effect when the `docstring-code-format` setting is
 # enabled.
 docstring-code-line-length = "dynamic"
 [dependency-groups]
 lint = [
    "ruff>=0.0.17",
 ]
 dev = [
    "ipython>=5.10.0",
 ]
--- a/src/requirements.txt
+++ b/src/requirements.txt
@@ -1,6 +0,0 @@
 openai-whisper==20231117
 ollama==0.1.8
 torch==2.5.0.dev20240712
 torchaudio==2.4.0.dev20240712
 torchvision==0.20.0.dev20240712
 transformers==4.42.4
--- a/src/summary.py
+++ b/src/summary.py
@@ -3,8 +3,11 @@ import argparse
 from pathlib import Path
 from transformers import pipeline
 import yt_dlp
 import torch
 OLLAMA_MODEL = "llama3"
 WHISPER_MODEL = "openai/whisper-large-v2"
 WHISPER_LANGUAGE = "en"  # Set to desired language or None for auto-detection
 # Function to download a video from YouTube using yt-dlp
 def download_from_youtube(url: str, path: str):
@@ -20,26 +23,70 @@ def download_from_youtube(url: str, path: str):
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])
 # Function to get the best available device
 def get_device():
    if torch.backends.mps.is_available():
        return "mps"
    elif torch.cuda.is_available():
        return "cuda"
    else:
        return "cpu"
 # Function to transcribe an audio file using the transformers pipeline
-def transcribe_file(file_path: str, output_file: str) -> str:
+def transcribe_file(file_path: str, output_file: str, language: str = None) -> str:
-    # Load the pipeline model for automatic speech recognition with MPS
+    # Get the best available device
-    transcriber_gpu = pipeline("automatic-speech-recognition", model="openai/whisper-large-v2", device="mps")
+    device = get_device()
    print(f"Using device: {device} for transcription")
    # Load the pipeline model for automatic speech recognition
    transcriber = pipeline(
        "automatic-speech-recognition", 
        model=WHISPER_MODEL, 
        device=device,
        chunk_length_s=30,  # Process in 30-second chunks
        return_timestamps=True  # Enable timestamp generation for longer audio
    )
    # Transcribe the audio file
-    transcribe = transcriber_gpu(file_path)
+    # For CPU, we might want to use a smaller model or chunk the audio if memory is an issue
    if device == "cpu":
        print("Warning: Using CPU for transcription. This may be slow.")
    # Set up generation keyword arguments including language
    generate_kwargs = {}
    if language and language.lower() != "auto":
        generate_kwargs["language"] = language
        print(f"Transcribing in language: {language}")
    else:
        print("Using automatic language detection")
    # Transcribe the audio file
    print("Starting transcription (this may take a while for longer files)...")
    transcribe = transcriber(file_path, generate_kwargs=generate_kwargs)
    # Extract the full text from the chunked transcription
    if isinstance(transcribe, dict) and "text" in transcribe:
        # Simple case - just one chunk
        full_text = transcribe["text"]
    elif isinstance(transcribe, dict) and "chunks" in transcribe:
        # Multiple chunks with timestamps
        full_text = " ".join([chunk["text"] for chunk in transcribe["chunks"]])
    else:
        # Fallback for other return formats
        full_text = transcribe["text"] if "text" in transcribe else str(transcribe)
    # Save the transcribed text to the specified temporary file
    with open(output_file, 'w') as tmp_file:
-        tmp_file.write(transcribe["text"])
+        tmp_file.write(full_text)
        print(f"Transcription saved to file: {output_file}")
    # Return the transcribed text
-    return transcribe["text"]
+    return full_text
 # Function to summarize a text using the Ollama model
 def summarize_text(text: str, output_path: str) -> str:
    # Define the system prompt for the Ollama model
-    system_prompt = f"I would like for you to assume the role of a Technical Expert"
+    system_prompt = "I would like for you to assume the role of a Technical Expert"
    # Define the user prompt for the Ollama model
    user_prompt = f"""Generate a concise summary of the text below.
    Text : {text}
@@ -73,9 +120,15 @@ def main():
    group.add_argument("--from-local", type=str, help="Path to the local audio file.")
    parser.add_argument("--output", type=str, default="./summary.md", help="Output markdown file path.")
    parser.add_argument("--transcript-only", action='store_true', help="Only transcribe the file, do not summarize.")
    parser.add_argument("--language", type=str, help="Language code for transcription (e.g., 'en', 'fr', 'es', or 'auto' for detection)")
    args = parser.parse_args()
    # Determine language setting
    language = args.language if args.language else WHISPER_LANGUAGE
    if language and language.lower() == "auto":
        language = None  # None triggers automatic language detection
    # Set up data directory
    data_directory = Path("tmp")
    # Check if the directory exists, if not, create it
@@ -94,7 +147,7 @@ def main():
    print(f"Transcribing file: {file_path}")
    # Transcribe the audio file
-    transcript = transcribe_file(str(file_path), data_directory / "transcript.txt")
+    transcript = transcribe_file(str(file_path), data_directory / "transcript.txt", language)
    if args.transcript_only:
        print("Transcription complete. Skipping summary generation.")
@@ -111,4 +164,4 @@ def main():
        print(f"Summary written to {args.output}")
 if __name__ == "__main__":
-    main()
+    main()
--- a/uv.lock
+++ b/uv.lock