Sep 3, 2024

Turn Your Videos Into Articles with OpenAI and LangChain

This article will show you how to efficiently transform YouTube videos into blog posts using OpenAI's GPT-4, LangChain, Python and Flask.

In the ever-evolving landscape of content creation, efficiency is key. Recently at the OpenAI Applications Explorers Meetup, Mike Pehel, a marketing consultant specializing in the drone industry and a contractor for the Linux Foundation, introduced an innovative approach to streamline the process of transforming YouTube videos into well-structured, informative blog posts. This method leverages the power of artificial intelligence, specifically OpenAI’s GPT-4, to automate what was once a labor-intensive task, and fun fact we use this tool to create the blog posts here on riis.com.

As always, you can read along here or follow along with the video version of the meetup.

The Old Way vs. The New Way

Traditionally, creating a blog post from a YouTube video involved a time-consuming process. To take a video from blog to article, content creators would produce the video, transcribe the content, review the GitHub repository, and re-watch the video at increased speed to capture all the necessary information. This method was not only inefficient but also redundant, as all the required content already existed in various forms.

The new approach simplifies this process dramatically. By utilizing a combination of APIs and AI tools, content creators can now generate a comprehensive blog post with minimal manual input. This method involves:

Inputting basic information into a form
Sending the data to various APIs
Using AI to format and generate a well-structured output

The result is a foundational draft that can be quickly refined into a polished blog post.

Overcoming AI Hallucinations

One of the primary challenges in using AI for content generation is the occurrence of hallucinations - instances where the AI produces confident but incorrect or nonsensical information. Here’s a helpful metaphor:

Imagine compressing every word on the internet into a 2D grid. When the AI needs to predict the next word in a phrase, it’s essentially finding the closest vector on this grid. This simplification helps explain why AI might sometimes produce unexpected or incorrect responses.

To mitigate hallucinations, the new approach involves:

Using highly detailed prompts
Creating multiple guideposts from our original content within the prompt to constrain the AI’s responses

The Technical Implementation

The core of this system is built using Flask, a lightweight web application framework in Python. Here’s an overview of our directory structure:

Feel free to make these files ahead of time and start filling them in as we go through the tutorial.

First thing to do is fill up your requirements.txt files with these:

anthropic==0.34.1
Flask==1.1.2
GitPython==3.1.43
langchain==0.2.14
markdown2==2.5.0
openai==1.42.0
PyGithub==2.3.0
youtube_transcript_api==0.6.2

Then issue the command pip install -r requirements.txt .

Flask Application Structure

The application follows a typical Flask structure:

--- run.py ---
from app import create_app

app = create_app()

if __name__ == '__main__':
    app.run(debug=True)

    
--- __init__.py ---
from flask import Flask
from config import Config

def create_app():
    app = Flask(__name__)
    app.config.from_object(Config)
    from app import routes
    app.register_blueprint(routes.main)
    return app

This setup creates a modular application, making it easier to manage different components and scale the project as needed.

Handling User Input

Flask uses what it calls templates for its pages. This structure allows us to pass information from our API calls into a fixed-format HTML file.

Our first file is index.html. The application uses a simple HTML form to collect user input:

--- index.html ---
<!DOCTYPE html>
<html lang="en">
<head>    
	<meta charset="UTF-8">    
	<meta name="viewport" content="width=device-width, initial-scale=1.0">    
	<title>Article Generator</title>
</head>
<body>    
	<h1>Article Generator</h1>    
	<form action="/" method="post" enctype="multipart/form-data">        
		<div id="speakersContainer">            
			<div class="speaker">                
				<h3>Speaker</h3>                
				<label for="speaker_name">Speaker Name:</label>                
				<input type="text" id="speaker_name" name="speaker_name" required><br><br>                
				<label for="speaker_bio">Speaker Bio:</label>                
				<textarea id="speaker_bio" name="speaker_bio"></textarea><br><br>            
			</div>        
		</div>        
		<label for="video_title">Video Title:</label>        
		<input type="text" id="video_title" name="video_title" required><br><br>        
		
		<label for="video_description">Video Description:</label>        
		<textarea id="video_description" name="video_description"></textarea><br><br>        
		<label for="youtube_url">YouTube URL:</label>        
		<input type="text" id="youtube_url" name="youtube_url"><br><br>        
		
		<label for="github_url">GitHub Repository URL:</label>        
		<input type="url" id="github_url" name="github_url"><br><br>        
		
		<input type="submit" value="Generate Article">    
	</form>
	</body>
	</html>

This form collects essential information such as speaker details, video title and description, YouTube URL, and GitHub repository URL.

We also need to make our Article.html file for out generated article.

--- article.html ---
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Generated Article</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            padding: 20px;
            max-width: 800px;
            margin: 0 auto;
        }
        h1 {
            color: #333;
        }
        .article-section {
            margin-bottom: 30px;
            padding: 20px;
            background-color: #f9f9f9;
            border-radius: 5px;
        }
    </style>
</head>
<body>
    <h1>Generated Article</h1>
    
    <div class="article-section">
        
        {{ article | safe }}
    </div>
    
</body>
</html>

Processing YouTube Transcripts

A crucial part of the application is extracting and processing the YouTube video transcript. This is achieved using the youtube_transcript_api library:

--- youtube_retriever.py ---
from youtube_transcript_api import YouTubeTranscriptApi 
from urllib.parse import urlparse, parse_qs 
import re

def get_youtube_id(url):
    # Patterns for different types of YouTube URLs    patterns = [
        r'^https?:\/\/(?:www\.)?youtube\.com\/watch\?v=([^&]+)',
        r'^https?:\/\/(?:www\.)?youtube\.com\/embed\/([^?]+)',
        r'^https?:\/\/(?:www\.)?youtube\.com\/v\/([^?]+)',
        r'^https?:\/\/youtu\.be\/([^?]+)',
        r'^https?:\/\/(?:www\.)?youtube\.com\/shorts\/([^?]+)',
        r'^https?:\/\/(?:www\.)?youtube\.com\/live\/([^?]+)'    ]
    # Try to match the URL against each pattern    
    for pattern in patterns:
        match = re.match(pattern, url)
        if match:
            return match.group(1)
    # If no pattern matches, try parsing the URL    
    parsed_url = urlparse(url)
    if parsed_url.netloc in ('youtube.com', 'www.youtube.com'):
        query = parse_qs(parsed_url.query)
        if 'v' in query:
            return query['v'][0]
    # If we still haven't found an ID, raise an exception    
    raise ValueError("Could not extract YouTube video ID from URL")
def get_youtube_transcript(url):
    try:
        # Extract video ID from the URL        
        video_id = get_youtube_id(url)
        # Get the transcript        
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        # Combine all text parts        
        full_transcript = " ".join([entry['text'] for entry in transcript])
        return full_transcript
    except ValueError as e:
        raise ValueError(f"Invalid YouTube URL: {str(e)}")
    except Exception as e:
        raise Exception(f"Error fetching YouTube transcript: {str(e)}")

This code handles various YouTube URL formats and extracts the video transcript, providing a solid foundation for the content generation process.

Analyzing GitHub Repositories

To incorporate code examples and additional context, the application analyzes the provided GitHub repository:

--- github_analyzer.py ---
import os
import tempfile
from git import Repo
from urllib.parse import urlparse
import base64
from github import Github

def analyze_github_repo(repo_url):
    # Extract owner and repo name from URL    
    parsed_url = urlparse(repo_url)
    path_parts = parsed_url.path.strip('/').split('/')
    owner, repo_name = path_parts[0], path_parts[1]
    # Initialize GitHub API client using the token from config.py    
    g = Github(os.environ.get('GITHUB_TOKEN'))
    repo = g.get_repo(f"{owner}/{repo_name}")
    all_code = ""    
    # Try to clone the repository first    
    try:
        with tempfile.TemporaryDirectory() as temp_dir:
            Repo.clone_from(repo_url, temp_dir)
            for root, _, files in os.walk(temp_dir):
                for file in files:
                    if file.endswith(('.py', '.js', '.html', '.css', '.java', '.cpp', '.toml', '.xml', '.json', '.jsonl')):
                        file_path = os.path.join(root, file)
                        with open(file_path, 'r', encoding='utf-8') as f:
                            all_code += f"\n\n--- {file} ---\n{f.read()}"    except Exception as e:
        print(f"Cloning failed: {str(e)}. Falling back to API method.")
        # If cloning fails, fall back to using the GitHub API        
        def get_contents(path=''):
            nonlocal all_code
            contents = repo.get_contents(path)
            for content in contents:
                if content.type == 'dir':
                    get_contents(content.path)
                elif content.name.endswith(('.py', '.js', '.html', '.css', '.java', '.cpp', '.toml', '.xml', '.json', '.jsonl')):
                    file_content = base64.b64decode(content.content).decode('utf-8')
                    all_code += f"\n\n--- {content.path} ---\n{file_content}"        get_contents()
    # Fetch README content    
    try:
        readme = repo.get_readme()
        readme_content = base64.b64decode(readme.content).decode('utf-8')
    except:
        readme_content = "README not found"    
    return all_code, readme_content

This function attempts to clone the repository locally or falls back to using the GitHub API if cloning fails. It extracts relevant code files and the README content, providing valuable context for the article generation process.

By combining these components with AI-powered content generation, the system offers a powerful solution for efficiently transforming YouTube videos into comprehensive blog posts. This approach not only saves time but also ensures that the generated content accurately reflects the original video material while incorporating relevant code examples and additional context from associated GitHub repositories.

Leveraging LangChain for Topic Extraction

LangChain, a powerful framework for developing language model-powered applications, is used to derive topics from the transcript. This helps in structuring the generated article:

import os
from openai import OpenAI
from anthropic import Anthropic
from langchain.text_splitter import RecursiveCharacterTextSplitter

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def parse_transcript(file_path):
    with open(file_path, 'r') as file:
        return file.read()

def derive_topics_from_transcript(transcript):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=3000,
        chunk_overlap=100,
        length_function=len
    )
    chunks = text_splitter.split_text(transcript)

    topics = [
    for chunk in chunks:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that generates concise and relevant topic titles."},
                {"role": "user", "content": f"Given the following chunk of text from a transcript, generate a concise and relevant topic title:\n\nChunk:\n{chunk}\n\nTopic Title:"}
            ],
            max_tokens=100
        
        )
        print(response.choices[0].message.content.strip())
        topics.append(response.choices[0].message.content.strip())
    

    return topics

This function splits the transcript into manageable chunks and uses OpenAI’s GPT model to generate relevant topic titles for each chunk.

Connecting the Frontend and Backend

Next, let’s create the main route in routes.py:

from flask import Blueprint, render_template, request, jsonify
from app.utils.file_parser import derive_topics_from_transcript
from app.utils.github_analyzer import analyze_github_repo
from app.utils.article_generator import generate_article
from app.utils.youtube_retriever import get_youtube_transcript
import markdown2

main = Blueprint('main', __name__)
@main.route('/', methods=['GET', 'POST'])
def index():
    if request.method == 'POST':
        # Get form data        
        speaker_name = request.form['speaker_name']
        speaker_bio = request.form['speaker_bio']
        video_title = request.form['video_title']
        video_description = request.form['video_description']
        github_url = request.form.get('github_url', '')
        youtube_url = request.form.get('youtube_url', '')
        # Handle transcript file        
        transcript_text = ""        
        if youtube_url:
            try:
                transcript_text = get_youtube_transcript(youtube_url)
            except Exception as e:
                return jsonify({'error': str(e)}), 400        
                if not transcript_text:
			            return jsonify({'error': 'No transcript provided or file not found'}), 400        
            # Initialize topics and topic_summaries        
            topics, topic_summaries = [], []
		        topics = derive_topics_from_transcript(transcript_text)
		        topic_summaries = [""] * len(topics)
        # Analyze GitHub repo if URL provided        
        github_code = ""        
        readme_content = ""        
        repo_size_mb = 0        
        if github_url:
            github_code, readme_content = analyze_github_repo(github_url)
        speaker_info = speaker_name + "\n " + speaker_bio
        # Generate article        
        article = generate_article(
            transcript_text, topics, topic_summaries,
            github_code, readme_content,
            speaker_info,
            video_title, video_description,
        )
        
        article_html = markdown2.markdown(article)
        
        return render_template('article.html', article=article_html)
    
    return render_template('index.html')

This route handles both GET and POST requests. When a POST request is received, it processes the form data, retrieves the YouTube transcript, derives topics and topic summaries from it, analyzes the GitHub repository (if provided), and then calls our genrate_article() which is up next.

Implementing the Article Generation Process

With the groundwork laid for handling user input we can now focus on the core functionality of generating the article. This process involves several key steps:

Combining data from various sources
Using AI to generate the article content
Performing final checks and refinements

Combining Data Sources

The generate_article function in article_generator.py serves as the central point for combining all the gathered information:

import os
from openai import OpenAI
from anthropic import Anthropic
from langchain.text_splitter import RecursiveCharacterTextSplitter

def generate_article(transcript, topics, topic_summaries, combined_code, readme_content, speaker_info, video_title, video_description):
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    
    system_message = """    You are a highly skilled technical writer with experience in the PX4 ecosystem including MAVSDK, MAVLink, uORB, QGroundControl ROS, ROS 2, Gazebo, and the Pixhawk open hardware standards.    Your task is to write a well-structured, engaging, and informative article or tutorial.    """    
    
    prompt = f"""    
    Write an 800-word article based on the following information:    
    Speaker Information: {speaker_info}    
    Session Information:    
    Title: {video_title}    
    Description: {video_description}    
    Transcript:    {transcript}    
    Topics:    {', '.join(topics)}    
    Topic Summaries:    {', '.join(topic_summaries)}    
    README Content:    {readme_content}    
    Relevant Code:    {combined_code if combined_code else "No relevant code found."}    
    Instructions:    
    1. Include an introduction and a conclusion.    
    2. Use the topics and topic summaries as a framework for the article's content.    
    3. Include relevant code snippets from the provided code, explaining each snippet's purpose and functionality.    
    4. Avoid code blocks longer than 14 lines. Break them into smaller, logical sections when necessary.    
    5. Format the output in markdown.    
    6. Aim for a well-structured, engaging, and informative article of approximately 800 words.    
    
    IMPORTANT:    Avoid using too many bulleted lists. Consolidate some lists into descriptive paragraphs if possible.   
     Use ONLY the Relevant Code provided in the prompt. Do not reference or use any code from your training data or external sources.    
     When code is relevant, introduce the concepts behind the code, then present the code, and finally describe how it works.    """    
     
     response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt}
        ],
        max_tokens=4000    
      )
    
    article = response.choices[0].message.content
    
    return article

This function takes all the collected data and constructs a detailed prompt for the AI model. The prompt includes specific instructions on how to structure the article, incorporate code snippets, and maintain a balance between technical depth and readability.

Notice in the system_message we are telling the LLM what it should focus on. You can swap out the subject matters listed here with those relevant to the target industry you are writing a tutorial for.

Take special note of the redundancies within the prompt. LLMs need to prompted multiple times with similar, but duplicate information to fully hit the target.

The last thing to pay attention to is the request to have the output in markdown. This gives us formatting without the token costs of outputting in the much more verbose HTML.

Generating the Article Content

The article generation process leverages OpenAI’s GPT-4 model to create the initial draft. The system message sets the context for the AI, positioning it as a technical writer with expertise in the PX4 ecosystem. This approach helps ensure that the generated content is both technically accurate and well-structured. In our routes.py

Final Checks and Refinements

After the initial article generation, an additional check is performed to enhance the quality and accuracy of the content. The check_code function in article_checker.py verifies that the code snippets in the article match the provided GitHub repository code:

def check_code(article, combined_code):
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    prompt = f"""    
    Audit this Article for the correct and accurate use of the code within it. Use the Combined Code as your primary code reference. If code in the Article looks similar or calls similar functions but is slightly different than the Combined Code, replace it in the final output with the relevant Combined Code snippet supplied in part or in whole, whichever fits best for the task.    
    
    If the article's content needs to be slightly modified to fit the code you are replacing it with from Combined Code, do so and only add content that is necessary.
        
    Article: {article}    
    Combined Code: {combined_code if combined_code else "No relevant code found."}    
    Return the edited article as your final output.    """    
    
    system_message = """    You are a highly skilled technical writer with experience in the PX4 ecosystem including MAVSDK, MAVLink, uORB, QGroundControl ROS, ROS 2, Gazebo, and the Pixhawk open hardware standards.    You are editing the content in the article for accuracy and correctness.    IMPORTANT: Your focus is auditing the code and replacing it if necessary.    Audit the code based on the Combined Code supplied. If no code is supplied, do nothing.    Return the whole edited article.    Do not add any llm assistant language such as and inclduing "no other edits were made", "Here's the edited article..", "Here's the edited article with the code snippets updated based on the Combined Code provided:" or the "The rest of the article remains unchanged". Only return the edited or non-edited article copy and nothing else.    """    
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt}
        ],
        max_tokens=4000    )
    return response.choices[0].message.content

Conclusion

By leveraging various AI technologies and APIs, it automates the process of converting YouTube videos into well-structured, informative blog posts.

The system’s key strengths lie in its ability to:

Extract and process information from multiple sources (YouTube transcripts, GitHub repositories, user input).
Use AI to derive topics and generate coherent, context-aware content.
Implement rigorous checks to ensure code accuracy and proper use of industry terms.
Present the generated content in a clean, readable format.

While the current implementation focuses on technical content related to the PX4 ecosystem, the underlying principles and architecture can be adapted to various domains and content types. Future enhancements could include support for longer articles, integration with slide decks, and the incorporation of images to further enrich the generated content.

As AI technologies continue to evolve, systems like this will play an increasingly important role in content creation and management, enabling content creators to maximize the value of their work across multiple platforms efficiently and effectively.