Create your own Video Knowledgebase with AWS


  1. Video Transcription Step
  2. Overall Architecture & Control Flow
  3. Creating Knowledge Base from QuickStart
  4. Deploying the Code (SAM Template)
  5. The KnowledgeBaseSync Lambda Function
  6. Viewing Embeddings in Aurora Serverless
  7. Interacting with the Knowledge Base
  8. Conclusion

Video Transcription Step

Please refer to the blog post to convert video to text using AWS Transcribe

Building a Scalable Video Transcription Architecture on AWS


Learn how to automate video transcription using AWS services like S3, EventBridge, SQS, Lambda, and Amazon Transcribe. This step-by-step guide helps you create an end-to-end scalable solution for seamless video-to-text conversion, leveraging AWS best practices for efficient architecture and automation.

Overall Architecture & Control Flow

Creating Knowledge Base from QuickStart

Please select “us-east-1” region and follow the on-screen instructions or demo video posted above.

Deploying the Code (SAM Template)

Here are some critical snippets from the SAM template:

  • Knowledge Base ID Parameter: We store the ID of our newly created Knowledge Base in the AWS Systems Manager (SSM) Parameter Store for easy access by other components.

YAML

  KnowledgeBaseIdParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: /poc/bedrock/knowledgebaseid
      Type: String
      Value: "kbId" # Replace with your actual Knowledge Base ID

  • SQS Queue, EventBridge Rule, and Lambda Function: These components form the core of our event-driven pipeline.

YAML

  # ... (SQS Queue, Queue Policy, EventBridge Rule definitions) ...

  KnowledgeBaseSync:
    Type: AWS::Serverless::Function
    Properties:
      # ... (Role, FunctionName, CodeUri, Handler definitions) ...
      Environment:
        Variables:
          KNOWLEDGEBASESYNC_QUEUE: !GetAtt KnowledgeBaseSyncQueue.Arn
          KNOWLEDGEBASEID_PARAMETER: !Ref KnowledgeBaseIdParameter
      Events:
        KnowledgeBaseSyncSQSEvent:
          Type: SQS
          Properties:
            Queue: !GetAtt KnowledgeBaseSyncQueue.Arn
            BatchSize: 2 
            Enabled: true
            ScalingConfig:
              MaximumConcurrency: 2

  # ... (IAM Role, Log Group definitions) ...

You can get the full code base in GitHub – https://github.com/pbkn/video-bedrock-knowledgebase

The KnowledgeBaseSync Lambda Function

This Java-based Lambda function is the heart of our automation. It performs the following tasks:

  • Retrieves Event Data: Extracts the S3 bucket name and object key from the SQS message.
  • Fetches Knowledge Base ID: Reads the Knowledge Base ID from the SSM Parameter Store.
  • Lists Data Sources: Retrieves a list of all data sources associated with the Knowledge Base.
  • Starts Ingestion Job: Initiates a data ingestion job for the relevant data source using the StartIngestionJob API of the BedrockAgentClient.

Here’s a snippet of the KnowledgeBaseSync function’s handleRequest method:

Java

public String handleRequest(SQSEvent input, Context context) {
    // ... (Initialization, SQS message parsing) ...

    // Extracting the knowledge base ID from the Environment Variable
    String knowledgebaseidParameter = System.getenv("KNOWLEDGEBASEID_PARAMETER");
    
    // ... (Error handling for missing parameter) ...

    // Extracting the knowledge base ID values from SSM Parameter store
    try (SsmClient ssmClient = SsmClient.builder()
            .region(Region.US_EAST_1)
            .credentialsProvider(DefaultCredentialsProvider.create())
            .build()) {
        knowledgeBaseIds = ssmClient.getParameter(builder -> builder
                .name(knowledgebaseidParameter)
                .withDecryption(true)
        ).parameter().value();
    } catch (Exception e) {
        // ... (Error handling for SSM retrieval) ...
    }

    String[] knowledgeBaseIdArray = knowledgeBaseIds.split(",");
    for (String kbId : knowledgeBaseIdArray) {
        kbId = kbId.trim();
        log.info("Knowledge Base ID: {}", kbId);

        // Get List of all Datasources from bedrock knowledgebase Id
        try (BedrockAgentClient bedrockAgentClient = BedrockAgentClient.builder()
                .credentialsProvider(DefaultCredentialsProvider.create())
                .region(Region.US_EAST_1)
                .build()) {
            ListDataSourcesResponse dataSourcesResponse = bedrockAgentClient.listDataSources(ListDataSourcesRequest.builder()
                    .knowledgeBaseId(kbId).build());
            for (DataSourceSummary ds : dataSourcesResponse.dataSourceSummaries()) {
                //Start data sync for updated bucket
                log.info("Data Source details: {}", ds);
                bedrockAgentClient.startIngestionJob(StartIngestionJobRequest.builder()
                        .knowledgeBaseId(kbId).dataSourceId(ds.dataSourceId())
                        .build());
            }
        } catch (Exception e) {
            // ... (Error handling for Bedrock API calls) ...
        }

    }

    return "KnowledgeBaseSync executed successfully";
}

You can get the full code base in GitHub – https://github.com/pbkn/video-bedrock-knowledgebase

Viewing Embeddings in Aurora Serverless

Make use of AWS Query Editor to view the embedding data in Aurora Server less database. Please refer the video above for step-by-step instruction

select * from bedrock_integration.bedrock_knowledge_base;

Interacting with the Knowledge Base

You now have your very own video based knowledge base

Conclusion

This POC demonstrates a powerful way to automate the ingestion of data into Amazon Bedrock Knowledge Bases. By leveraging event-driven architecture and serverless components, we can build scalable and efficient pipelines that keep our GenAI applications informed with the latest information. This approach frees developers from manual data management tasks, allowing them to focus on building innovative and impactful GenAI solutions. The same architecture can be used to ingest data into other vector stores like OpenSearch, RDS for Postgres with pgVector etc.

Next Steps:

  • Explore how to integrate this pipeline with your existing transcription workflow.
  • Customize the KnowledgeBaseSync function to handle different data formats or perform additional data processing.
  • Experiment with different Bedrock models and Knowledge Base configurations to optimize your GenAI application’s performance.

By implementing this automated ingestion pipeline, you can unlock the full potential of Amazon Bedrock Knowledge Bases and build truly intelligent GenAI applications.