top of page

APE-ing Around the Advertisement Prediction Engine

  • Writer: Dylan Husserl
    Dylan Husserl
  • Oct 22
  • 7 min read

One of our interns, Dylan Trenck - see the bottom of this post for his contact info - worked over the past few months on prototyping a proof-of-concept VIDEO advertisement prediction engine. Sporelogic views video interpretation and prediction as a natural extension, of our current capabilities - so getting to see Dylan investigate this space was super interesting for us. We have reformatted an internal paper he produced into a blog post, and want to share it with everyone. Enjoy - you can try the demo out for yourself at:



Advertisement Prediction Engine (APE): A Proof‑of‑Concept for Predicting Ad Performance


ree

Short‑form attention is scarce and expensive. Creators and marketers need a fast way to estimate whether an advertisement is likely to perform before spending on distribution. The Advertisement Prediction Engine (APE) is a lightweight, multimodal proof‑of‑concept that predicts whether an advertisement will be a high or low performer based on its audio track and transcript. Trained on 313 top‑performing branded advertisements, the system currently achieves approximately 65%accuracy and returns an interpretable summary of the features that most influenced each decision. This paper presents the motivation for the project, the major design choices, the data and features used, the training and validation procedure, and a roadmap for improving accuracy and scope.


In short‑form media, success is driven by a few seconds of strong attention capture, clear intent, and an emotional rhythm that motivates action. My previous research into this problem space suggested that structural elements such as early hooks, pacing, sentiment flow, and the clarity of calls to action correlate with virality and lift. I initially considered and made an end‑to‑end model that consumed synchronized video, audio, and transcript features and produced a quantitative estimate of performance. In practice however, the complexity of that approach obscured the signal and made the first version difficult to reliably train at the scale available. Therefore, I reframed the problem as a simpler binary classification task and focused on content signals that can be extracted quickly and consistently from an advertisement’s audio and transcript, without engaging with the actual video frames yet. Throughout this paper, “performant” refers to a creative that is more likely to fall among the highest performers within the training distribution, conditioned on the view‑based popularity of branded ads.


The dataset comprises of 313 high‑performing brand advertisements collected with the YouTube Data API. To mitigate brand‑size bias, I sampled across channels of varied sizes and product domains. Each advertisement was processed by extracting the audio stream from the original MP4 using FFmpeg and converting it to a WAV file. Transcripts were generated from the audio using OpenAI Whisper originally, but after facing difficulties with recognizing speech, this library was replaced with Google Speech Recognition with tuned detection thresholds to improve coverage on speech segments that are faint or partially masked by music or other layered volume. All downstream features were standardized with a standard scaler fit only on the training split to avoid any leakage. For training, the dataset was split by high performers vs. low performers via a threshold determined by the advertisement’s view velocity, calculated by taking the video’s views per day normalized by the channel’s subscriber count.


The model uses a small set of handcrafted features designed to capture the structural elements of an advertisement that are most predictive of early engagement and intent. From audio I measured overall RMS loudness, dynamic range, energy variability across time, coarse pitch statistics, speech rate, and the distribution of pause durations. Together these features describe the pacing and energy of an advertisement without requiring heavy or complex acoustic modeling. From the transcripts I created indicators for the presence of a hook within the first 15 words, the count and position of calls to action, a transcript‑level sentiment score, and counts of emphatic or interrogative constructions such as exclamations and questions. While intentionally simple, these features reflect patterns repeatedly cited in creative advertisement best practices and throughout my research and are robust to any small transcription errors.



APE uses a single Random Forest classifier trained on a twenty‑dimensional vector that is formed by concatenating the ten audio features with the ten transcript features. The choice of a Random Forest balances three goals for this proof‑of‑concept: resilience to nonlinear interactions among features, tolerance of small sized data without extensive parameter searches, and straightforward interpretability with feature importances. I evaluated the model with stratified cross‑validation and a held‑out evaluation set. Across splits, the classifier achieved roughly 65% accuracy. Alongside the predicted class, the system reports the features that contributed most to the decision for a given ad, enabling rapid creative iteration guided by specific, actionable signals.

ree

The current pipeline accepts a .MP4 file as input, extracts and resamples the audio, produces a transcript, computes the audio and text features, standardizes them, and performs inference through the Random Forest. The model output is a binary label of either high or low predicted performance. This is displayed to the user on the frontend with a confidence estimate and a short natural‑language explanation that points to the hook quality, the density and clarity of calls to action, the apparent speech rate, the presence or absence of sustained pauses, and any other noticeable indicators. The entire process is designed to be fast enough for batch evaluation of many variants during creative development and speed up iteration. 


The first challenge in this project was the choice of model prediction. Using logistic regression models to predict individual counts of likes, comments, shares, made it difficult to build and train a model have any reliability given the smaller size of the dataset available.  By changing the design of the model to be a binary classification problem it made it far easier for the model to generalize over a single prediction of High vs Low performance as opposed to multiple regressive predictions. The second challenge I faced came in the form of the detection of a transcript. Early experiments with OpenAI Whisper failed to capture speech in a large fraction of clips, particularly when voiceover competed with music or sound effects, which drastically reduced overall model accuracy. Switching to Google Speech Recognition and lowering the detection threshold substantially improved coverage without requiring manual cleanup. Overhauling the speech recognition, narrowing the scope to audio and text, and recasting the prediction as a binary classification problem, immediately improved learnability and reduced brittleness. These changes preserved interpretability while giving the model room to generalize beyond the small dataset provided.


ree

On held‑out evaluation, APE classifies advertisements as high or low performers with an accuracy of approximately 65%. More important than the headline number is the model’s ability to surface reasons for its decisions. Advertisements flagged as likely high performers typically display an explicit hook in the opening line, a measurable cadence in delivery with few long pauses, and at least one clear call to action stated in direct language. Ads labeled as likely low performers often lack an early value proposition, bury the ask, or exhibit speech patterns that feel meandering or under‑energized. Because the features are human‑readable, the explanations align with creative intuition and can be acted upon during iteration.


As a proof‑of‑concept, APE demonstrates that content structure alone (absent of any frame‑level video content), contains enough information to provide a useful first‑pass screen of advertisement’s quality. The compact feature set is intentionally conservative, favoring robustness and interpretability over raw expressiveness. The current accuracy should not be interpreted as a ceiling. Rather, it establishes a practical baseline from which to expand modalities and improve generalization. In retrospect, the most consequential choice was to prioritize a minimal, end‑to‑end system that worked reliably over an ambitious tri‑modal architecture that strained the data and obscured the signal of a quality advertisement. That decision accelerated iteration, clarified the error modes, and produced output that is readily explainable to creators.


The present product is limited by dataset size, platform scope, and the absence of visual features. Expanding the corpus to include a broader range of brands, verticals, and languages will reduce variance and better reflect the creative norms of different markets both domestic and international. Incorporating ads from sources beyond YouTube, such as the Meta and TikTok Ad Libraries, should improve coverage of shorter‑form conventions and platform‑specific pacing. On the modeling side, a larger more diverse dataset will improve the accuracy of the audio/transcript-based model. The logical next step is to reintroduce video features and fuse the predictions of all three modalities to achieve an accurate generalization based on the entire content of an advertisement. The inclusion of frame-level analysis would be a major advancement of the APE but comes with a computational cost that will have to be considered during development. A complete model will be able to fully capture the key elements from an advertisement that make them perform well and be able to accurately predict their performance pre-launch, enabling creators to test their work before it hits the main stage.



The current implementation follows a straightforward sequence: FFmpeg extracts and resamples the audio track; librosa computes the acoustic features; Google Speech Recognition produces the transcript from which structural and sentiment features are derived; both feature sets are standardized and then concatenated; a Random Forest classifier performs inference and returns the predicted class, a confidence score, and a short explanation assembled from the highest‑weight features. The code is structured to allow batch processing so that creative teams can score many iterations and quickly identify which edits move the prediction in the right direction.

ree

APE shows that well‑chosen audio and text features are sufficient to generate a useful early read on ad performance and to produce guidance that maps cleanly to editing decisions. The next phase of work will expand the dataset, reintroduce a light layer of visual descriptors, and explore simple fusion strategies that preserve the system’s clarity while raising accuracy. The long‑term goal is a nimble, interpretable predictor that helps creators and marketers test ideas cheaply, learn faster from each iteration, and invest distribution spend where it is most likely to drive outcomes.



At the time of this post Dylan is a senior at University of North Florida planning on graduating August 2026 - check out his LinkedIn here!



 
 
 

Recent Posts

See All
Understanding A/B Testing: Insights from Horizon

The Power of A/B Testing A/B testing is a powerful tool in digital marketing. It allows businesses to compare two versions of a webpage or content to see which one performs better. This method helps i

 
 
 

Comments


© 2025 SporeLogic

bottom of page