Skip to content

built custom py scrapper#140

Open
yb175 wants to merge 1 commit intomainfrom
parsing-engine
Open

built custom py scrapper#140
yb175 wants to merge 1 commit intomainfrom
parsing-engine

Conversation

@yb175
Copy link
Copy Markdown
Owner

@yb175 yb175 commented May 2, 2026

{
    "total": 10,
    "jobs": [
        {
            "title": "AI/ML Engineering Manager, Payment Intelligence",
            "company": "Stripe",
            "location": "US-SF, US-NYC, US-SEA",
            "remote": false,
            "description": "<h2>Who we are</h2> <h3>About Stripe</h3> <p>Stripe’s mission is to accelerate global economic and technological development. We offer financial infrastructure and a variety of services to serve the needs of a wide range of users, from startups to enterprises, with global scale and industry-leading reliability and product quality.&nbsp; All financial services businesses face a trade-off between access, which we want to expand, and risk, which we want to minimize. We use machine learning to scalably and intelligently optimize across both.</p> <p>Artificial Intelligence and Machine learning is an integral part of almost every service at Stripe. It is a key investment area with products and use cases that span merchant and transaction risk, payments optimization, identity, and merchant data analytics and insights. We are also using the latest generative AI technologies (such as LLMs and FMs) to re-imagine product experiences and developing AI Assistants and Agents both for our customers (e.g. Radar Assistant and Sigma Assistant), and also to make Stripes more productive across Support, Marketing, Sales, and Engineering roles within the company.</p> <h3>About the team</h3> <p>We are dedicated to accelerating business value delivery via AI/ML through the Payment Intelligence Suite of products (Radar, AuthBoost, Payments Analytics, Authentication, Disputes) and any other Stripe product that works with these solutions. Our mission is to make Stripe the leader in payments performance via AI/ML, and to allow any other Stripe team looking to balance performance across revenue, cost, and risk the same capabilities so that our shared users have a consistently great experience managing performance on Stripe. We work closely with our partners in the Information org, Risk, RFA, and Money-as-a-Service to build solutions that address our users needs and drive revenue for Stripe.</p> <p>From a data perspective, Stripe handles over <a href=\"https://stripe.com/newsroom/news/stripe-2024-update\">$1.4T in payments volume</a> per year, which is roughly <a href=\"https://en.wikipedia.org/wiki/Gross_world_product\">1.3% of the world’s GDP</a>. We process petabytes of financial data to drive analytics and AI/ML solutions. We use a combination of highly scalable and explainable models such as linear/logistic regression and random forests, along with the latest deep neural networks from transformers to LLMs. Our latest innovations have been around figuring out how best to bring transformers and LLMs to improve existing models and also enable entirely new product ideas that are only made possible by GenAI.</p> <h2>What you’ll do</h2> <p>In this role, you will be a transformative leader with the responsibility of overseeing three critical teams within PayIntel. You will drive the strategic direction and execution for how Payments, and Stripe more generally, adopts AI, and how our suite of performance products leverage ML/AI to provide a consistent experience that generates revenue for Stripe.&nbsp;</p> <h3>Responsibilities</h3> <ul> <li>Lead the development of our decisioning platform, ensuring that ML/AI decisions are consistent across the lifecycle of a payment regardless of which product makes those decisions.&nbsp;</li> <li>Collaborate with the ML Foundations team to develop and deploy Stripe’s Foundation Model to risk, conversion and growth opportunities in Payments.</li> <li>Extend our performance analytics, observability, and risk management capabilities across Stripe so that users have a consistently high quality performance experience for cost, revenue, and risk across Stripe.</li> <li>Expand our Payments Analytics solution to ensure that users are fully aware of performance opportunities and can take full advantage of our suite of products to automate improvements.</li> </ul> <h2><strong>Who you are</strong></h2> <p><span style=\"font-weight: 400;\">We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.</span></p> <h3>Minimum requirements</h3> <ul> <li>10+ years of experience building and shipping ML models that power AI/ML product features, with a strong emphasis on modern technologies such as DNNs, Transformers, and Foundation Models.</li> <li>5+ years of experience managing and developing a team of managers, fostering their growth and ensuring alignment with strategic objectives</li> <li>A strong builder mentality, with the ability to define a team's charter and lead the development of complex systems from scratch.</li> <li>Proven ability to shepherd large, complex projects and drive transformational change in an organization and with partners that depend on your team’s platform services.</li> <li>Deep passion for solving really interesting problems, willingness to experiment, engage with customers directly to understand how well our solutions are working, and to build deep knowledge about performance that drives impact across the company.</li> </ul> <h3><strong>Preferred qualifications</strong></h3> <ul> <li>Experience with a large-scale, data-rich product in a domain such as payments, commerce, search, or social media.</li> <li>Knowledge of the challenges and opportunities in applying ML to fraud prevention, consumer intelligence, or financial services.</li> <li>Experience building platforms that accelerate service adoption outside your organization with little maintainability overhead.</li> </ul>",
            "apply_url": "https://stripe.com/jobs/search?gh_jid=7286376",
            "source": "greenhouse"
        },
        {
            "title": "Analytics Engineer",
            "company": "Stripe",
            "location": "Seattle, WA",
            "remote": false,
            "description": "<h3><strong>About Stripe</strong></h3> <p>Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.</p> <h3>About the team</h3> <p>The Global Payments Performance Team are Stripe’s foremost payments experts, identifying payment optimization opportunities for Stripe’s users. We serve a critical role both in demonstrating thought leadership on challenging payment topics and partnering with our high-potential merchants to set them up for long term success. We make payments performance a competitive advantage for our users, large and small, by providing them with best-in-class technology in concert with insightful and specific advice on how to harness this technology to achieve their business goals.</p> <h3>What you’ll do</h3> <p>You’ll be embedded directly within the Global Payments Performance team, and responsible for data &amp; tooling used by the team and our stakeholders. You will be instrumental in scaling the subject matter expertise of the team’s payments domain experts - building datasets, agents, and tools that enable Stripes to deliver tailored, expert level content to their customers.</p> <h3>Responsibilities</h3> <ul> <li>Develop analytics products, dashboards, and tools that GTM leverages to deliver expert level payment optimization sessions to Stripe customers.</li> <li>Maintain analytics assets owned by the team, such as metric definitions, technical documentation, vertical agents and query repositories.</li> <li>Build data pipelines that power our user insights.</li> <li>Act as a resource to the team communicating the nuance and technical complexity of payments performance adjacent data.</li> </ul> <h3>Who you are</h3> <p>We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.</p> <h3>Minimum requirements</h3> <ul> <li>7+ years experience in Analytics, Business Intelligence Engineering, Data Science or Technical Customer Advisory roles.</li> <li>4+ years experience with SQL and Python.</li> <li>Working understanding of code development processes and best practices s.a. git, code reviews, and testing.</li> <li>Expertise in data visualization and using data insights to make recommendations.</li> <li>Proven ability to manage and deliver on multiple projects with great attention to detail.</li> <li>Ability to clearly communicate results and drive impact.</li> </ul> <h3>Preferred qualifications</h3> <ul> <li>Degree in Mathematics, Statistics, Economics, Engineering, or a related technical field.</li> <li>Familiarity with the payments ecosystem and how merchants optimize payment processing (e.g., conversion, authorization rates, authentication success rates, transaction fraud, and network costs</li> </ul>",
            "apply_url": "https://stripe.com/jobs/search?gh_jid=7863844",
            "source": "greenhouse"
        },
        {
            "title": "Android Engineer, Terminal",
            "company": "Stripe",
            "location": "Toronto",
            "remote": false,
            "description": "<h2><strong>Who we are</strong></h2> <h3><strong>About Stripe</strong></h3> <p>Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.</p> <h3><strong>About the team</strong></h3> <p>Stripe Terminal helps our users extend their online presence to the physical world. The Terminal team’s mission is to make it as easy for businesses to accept in-person payments as the Stripe API has done for online payments. Stripe was founded to make it easier for developers to accept payments. We’ve solved a small part of that problem, but our ambition is to go much further.&nbsp;</p> <h2><strong>What you’ll do</strong></h2> <ul> <li>Android engineers on this team will build and enhance applications, services, and the OS that run on the physical Terminal devices.&nbsp;</li> <li>Building the frameworks for other engineers, both internal and external to Stripe, to develop on our custom platform with ease.</li> </ul> <h3><strong>Responsibilities</strong></h3> <ul> <li>Design, build and maintain Android apps and SDKs in Kotlin</li> <li>Develop Android payment applications for a variety of devices and form factors</li> <li>Work with engineers, product managers, designers, and stakeholders across the company to bring new features and products to Stripe’s mobile users</li> <li>Collaborate with Android developers who work on the Stripe mobile apps and Stripe Terminal to set best practices for Android development across the company</li> <li>Work with user research and product design to understand users and address their needs</li> </ul> <p>&nbsp;<strong>Who you are</strong></p> <p>We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.</p> <h3><strong>Minimum requirements</strong></h3> <ul> <li>2+ years of experience in Android Development</li> <li>Experience working with at least 1 of the following: Kotlin, Java, Swift, Objective-C, Go, Python</li> </ul> <h3><strong>Preferred qualifications</strong></h3> <ul> <li>Payments expertise or knowledge</li> <li>Backend infrastructure or services experience</li> <li>Listens well and internalize the best ideas from all over the organization while also setting a vision that others are excited to get behind</li> <li>You prefer simple solutions and designs over complex ones, and have a good intuition for what is lasting and scalable</li> <li>Thrive in a collaborative environment involving different stakeholders and subject matter experts</li> <li>Can put yourself in the shoes of your users and be a steward of crafting great developer and consumer experiences</li> </ul> <p>&nbsp;</p>",
            "apply_url": "https://stripe.com/jobs/search?gh_jid=7543559",
            "source": "greenhouse"
        },
        {
            "title": "Android Engineer, Terminal Global Payments",
            "company": "Stripe",
            "location": "San Francisco, CA, Seattle, WA",
            "remote": false,
            "description": "<h1>Who we are</h1> <h2>About Stripe</h2> <p>Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.</p> <h2>About the team</h2> <p>Stripe Terminal helps our users extend their online presence to the physical world. The Terminal team’s mission is to make it as easy for businesses to accept in-person payments as the Stripe API has done for online payments. Stripe was founded to make it easier for developers to accept payments. We’ve solved a small part of that problem, but our ambition is to go much further.&nbsp;</p> <p>&nbsp;</p> <p>Engineers on the Terminal Global Payments team will build and enhance the Payments Platform including the applications and services that run on the physical Terminal devices and expanding access to local payment methods, meeting our merchants/gateways where their in-person payment needs are. Your work will also include building the frameworks for other engineers, both internal and external to Stripe, to develop on our custom platform with ease. As part of this role, you will focus on working on Terminal’s Android devices and Tap to Pay development, with opportunities to build on Stripe’s backend payments infrastructure as well.</p> <h1>What you'll do</h1> <h2>Responsibilities</h2> <ul> <li>Design and develop end-to-end payment features spanning mobile<strong> </strong>applications, device-level integrations and backend services for a variety of devices and form factors.</li> <li>Collaborate closely with the Terminal backend and infrastructure teams to design and implement scalable payments solutions across Stripe's services.</li> <li>Support the development of mobile device testing infrastructure and automation frameworks&nbsp;</li> <li>Work with engineers, product managers, designers, and stakeholders across the company to deliver complete features.</li> <li>Work with user research and product design to understand users and address their needs across the full stack.</li> </ul> <h2>Who you are</h2> <p>We're looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.</p> <h2>Minimum requirements</h2> <ul> <li>Have a strong technical background, with 2+ years of experience, working with at least 1 of the following: Java, Kotlin, Swift, Objective-C</li> <li>Experience building and maintaining backend services or distributed systems.</li> <li>Demonstrated ability to work across the stack, from mobile clients to backend infrastructure.</li> </ul> <h2>Preferred qualifications</h2> <ul> <li>Experience with backend languages and frameworks (Ruby, Java, Golang, or similar)</li> <li>Payments expertise or domain knowledge</li> <li>Experience with SDKs, libraries, or developer-facing tools</li> </ul>",
            "apply_url": "https://stripe.com/jobs/search?gh_jid=7778627",
            "source": "greenhouse"
        },
        {
            "title": "Backend / API Engineer, Payouts",
            "company": "Stripe",
            "location": "United Kingdom",
            "remote": false,
            "description": "<h5><strong><em>Note:&nbsp;</em></strong><em>if you are an intern, new grad, or staff applicant, please do not apply using this link and visit our </em><a href=\"https://stripe.com/jobs/search\"><em>jobs page</em></a><em> for those specific postings.</em></h5> <h2><strong>Who we are</strong></h2> <h3><strong>About Stripe</strong></h3> <p>Stripe is a financial infrastructure platform for businesses. Millions of companies - from the world’s largest enterprises to the most ambitious startups - use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone's reach while doing the most important work of your career.</p> <h3><strong>About the Organization&nbsp;</strong></h3> <p>The Payments organization focuses on developing products and platforms that enable users to accept payments from customers efficiently. This includes building APIs for processing payments, enabling regional, non-card payment options, and extending Stripe's capabilities to make it easy for businesses to accept in-person payments. The Risk Engineering team develops products that minimize financial and regulatory risks while ensuring a seamless user experience, thereby safeguarding Stripe’s brand and financial stability.&nbsp;</p> <p><strong>Team Matching:</strong> exact team matching for one of the subteams within this org will begin during final stages.<em> Please note we may also consider you for different orgs based on your experience, location, etc. </em>More information on our team matching process can be found <a href=\"https://docs.google.com/document/d/15zZJetyRWG5DAk96sawRjLDI_cpN1QoeLMn4aAPCrYs/edit?tab=t.0#heading=h.x8j6mm75yuwr\">here</a>.&nbsp;</p> <h2><strong>What you’ll do</strong></h2> <p>We’re looking for Backend engineers who want to make an impact on managing money at a global scale with a passion for building ergonomic APIs. You’ll play a key role in extending our balance management platform and in building out a new funds accessibility platform leveraged by enterprises and SMBs alike. Our team collaborates with many cross-functional teams – from Infrastructure to Product –&nbsp; at Stripe to deliver innovative solutions that address evolving user needs.</p> <h3><strong>Responsibilities</strong></h3> <ul> <li>Scope, design, build, and maintain APIs, services, and large-scale systems that reliably and efficiently handle billions of money movement requests</li> <li>Debug and solve critical production issues across services and multiple levels of the stack</li> <li>Mentor engineers to help them grow</li> <li>Collaborate with stakeholders across the company to build new features at large-scale, while improving internal engineering standards, tooling, and processes</li> <li>Collaborate effectively in a distributed and hybrid team, maintaining open communication and strong connections with colleagues</li> </ul> <h2><strong>Who you are</strong></h2> <p>We're looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.</p> <h3><strong>Minimum requirements</strong></h3> <ul> <li>7-12+ years of industry software engineering experience (does not include internships nor includes co-ops)</li> <li>Strong coding skills in any programming language <em>(we understand new languages can be learned on the job so our interview process is language agnostic)</em></li> <li>Strong collaboration skills, can work across workstreams within your team and contribute to your peers’ success</li> <li>Have the ability to thrive on a high level of autonomy, responsibility, and think of yourself as entrepreneurial</li> <li>Interest in working as a generalist across varying technologies and stacks to solve problems and delight both internal and external users</li> </ul> <h3><strong>Preferred Qualifications</strong></h3> <ul> <li>Experience with large-scale financial tracking systems</li> <li>Good understanding and practical knowledge in cloud based services (e.g. gRPC, GraphQL, Docker/Kubernetes, cloud services such as AWS, etc.)&nbsp;</li> </ul>",
            "apply_url": "https://stripe.com/jobs/search?gh_jid=7369543",
            "source": "greenhouse"
        },
        {
            "title": "Data Scientist",
            "company": "Figma",
            "location": "San Francisco, CA • New York, NY • United States",
            "remote": false,
            "description": "<div class=\"content-intro\"><p>Figma is growing our team of passionate creatives and builders on a mission to make design accessible to all. Figma’s platform helps teams bring ideas to life—whether you're brainstorming, creating a prototype, translating designs into code, or iterating with AI. From idea to product, Figma empowers teams to streamline workflows, move faster, and work together in real time from anywhere in the world. If you're excited to shape the future of design and collaboration, join us!</p></div><div class=\"section page-centered\"> <p>We are looking for an experienced Data Scientist to join our growing data team. At Figma, Data Scientists are deeply embedded within cross-functional teams across the company—from Product to Finance, Marketing, and Platform. We are hiring across a number of roles in these areas. This is ideal for someone excited to own high-impact data projects, partner strategically with stakeholders, and shape the future of our products and business.</p> <p>The ideal candidate will bring strong analytical and technical skills, business intuition, and a collaborative mindset to guide decision-making and unlock growth opportunities. You’ll work on problems ranging from understanding user behavior to optimizing revenue strategies, improving internal tooling, and influencing product direction through experimentation and data modeling.</p> <p>This is a full-time role that can be held from one of our US hubs or remotely in the United States.</p> <h4>What you’ll do at Figma:</h4> <ul> <li>Collaborate across teams to define and measure key metrics, design experiments, and uncover insights that inform strategic decisions</li> <li>Build models and analytical frameworks to support product, marketing, platform, or finance initiatives</li> <li>Develop tools, datasets, and systems that enable others to work with data more efficiently and rigorously</li> <li>Own complex data projects end-to-end—from problem scoping to solution delivery</li> <li>Champion data quality, accessibility, and the democratization of data across the organization</li> <li>Partner with Product, Engineering, Design, Research, Sales, Marketing, or Finance to drive impact</li> </ul> <h4>We'd love to hear from you if you have:</h4> <ul> <li>4+ years of experience in Analytics, Data Science, or a related field</li> <li>Fluency in SQL and proficiency in a scripting language like Python or R</li> <li>Experience with distributed data systems (e.g., Redshift, Snowflake, Presto, Hive, Spark)</li> <li>Strong foundation in statistical methods, experimentation, and/or forecasting</li> <li>A track record of working cross-functionally and communicating effectively with both technical and non-technical partners</li> <li>Experience supporting one or more of the following: Product, Marketing, Finance, or internal Platform/Tooling teams</li> </ul> <h4><strong>While it’s not required, it’s an added plus if you also have:</strong></h4> <ul> <li>A self-starter attitude and the ability to thrive in ambiguous and fast-paced environments</li> </ul> <div class=\"section page-centered\"> <div> <div>At Figma, one of our values is Grow as you go. We believe in hiring smart, curious people who are excited to learn and develop their skills. If you’re excited about this role but your past experience doesn’t align perfectly with the points outlined in the job description, we encourage you to apply anyways. You may be just the right candidate for this or other roles.</div> </div> </div> </div><div class=\"content-pay-transparency\"><div class=\"pay-input\"><div class=\"description\"><p><strong><span style=\"font-size: 16px;\">Pay Transparency Disclosure</span></strong></p> <p>If based in Figma’s San Francisco or New York hub offices, this role has the annual base salary range stated below.&nbsp;&nbsp;&nbsp;&nbsp;</p> <p>Job level and actual compensation will be decided based on factors including, but not limited to, individual qualifications objectively assessed during the interview process (including skills and prior relevant experience, potential impact, and scope of role), market demands, and specific work location. The listed range is a guideline, and the range for this role may be modified. For roles that are available to be filled remotely, the pay range is localized according to employee work location by a factor of between 80% and 100% of range. Please discuss your specific work location with your recruiter for more information.&nbsp;</p> <p>Figma offers equity to employees, as well a competitive package of additional benefits, including health, dental &amp; vision, retirement with company contribution, parental leave &amp; reproductive or family planning support, mental health &amp; wellness benefits, generous PTO, company recharge days, a learning &amp; development stipend, a work from home stipend, and cell phone reimbursement.&nbsp; Figma also offers sales incentive pay for most sales roles and an annual bonus plan for eligible non-sales roles. Figma’s compensation and benefits are subject to change and may be modified in the future.</p></div><div class=\"title\">Annual Base Salary Range:</div><div class=\"pay-range\"><span>$140,000</span><span class=\"divider\">&mdash;</span><span>$348,000 USD</span></div></div></div><div class=\"content-conclusion\"><p>At Figma we celebrate and support our differences. We know employing a team rich in diverse thoughts, experiences, and opinions allows our employees, our product and our community to flourish. Figma is an <a href=\"https://www.eeoc.gov/sites/default/files/2022-10/EEOC_KnowYourRights_screen_reader_10_20.pdf\">equal opportunity workplace</a> - we are dedicated to equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity/expression, veteran status<strong>, </strong>or any other characteristic protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.</p> <p>We will work to ensure individuals with disabilities are provided reasonable accommodation to apply for a role, participate in the interview process, perform essential job functions, and receive other benefits and privileges of employment. If you require accommodation, please reach out to <a href=\"mailto:accommodations-ext@figma.com\">accommodations-ext@figma.com</a>. These modifications enable an individual with a disability to have an equal opportunity not only to get a job, but successfully perform their job tasks to the same extent as people without disabilities.&nbsp;</p> <p>Examples of accommodations include but are not limited to:&nbsp;</p> <ul> <li>Holding interviews in an accessible location</li> <li>Enabling closed captioning on video conferencing</li> <li>Ensuring all written communication be compatible with screen readers</li> <li>Changing the mode or format of interviews&nbsp;</li> </ul> <p>To ensure the integrity of our hiring process and facilitate a more personal connection, we require all candidates keep their cameras on during video interviews. Additionally, if hired you will be required to attend in person onboarding.</p> <p>By applying for this job, the candidate acknowledges and agrees that any personal data contained in their application or supporting materials will be processed in accordance with <a class=\"c-link c-link--underline\" href=\"https://www.figma.com/legal/candidate-privacy-notice/\" target=\"_blank\" data-stringify-link=\"https://www.figma.com/legal/candidate-privacy-notice/\" data-sk=\"tooltip_parent\">Figma's Candidate Privacy Notice</a>.</p></div>",
            "apply_url": "https://boards.greenhouse.io/figma/jobs/5552580004?gh_jid=5552580004",
            "source": "greenhouse"
        },
        {
            "title": "Data Scientist, Core Data -  PhD (2026)",
            "company": "Figma",
            "location": "New York, NY • United States; San Francisco, CA • New York, NY",
            "remote": false,
            "description": "<div class=\"content-intro\"><p>Figma is growing our team of passionate creatives and builders on a mission to make design accessible to all. Figma’s platform helps teams bring ideas to life—whether you're brainstorming, creating a prototype, translating designs into code, or iterating with AI. From idea to product, Figma empowers teams to streamline workflows, move faster, and work together in real time from anywhere in the world. If you're excited to shape the future of design and collaboration, join us!</p></div><p>We're looking for a research-minded Data Scientist to join the Core Data team. This team is a group of analytics professionals and Engineers building the foundational platforms for data science at Figma.&nbsp; We build the experimentation, analytics, and AI tooling that every product team relies on to make confident, data-driven decisions, partnering closely with Data Infra, ML, and Applied Science to evolve our platforms and embed AI into the daily workflows of data scientists across the company.</p> <p>This role is for someone who thrives at the intersection of rigorous research and real-world impact. You'll bring PhD-level depth to problems that matter. This includes advancing our experimentation platform and developing machine learning-based analytical systems. You will also help craft how we measure AI-powered features through causal inference and statistical modeling.&nbsp;&nbsp;</p> <p>This is a full time role that can be held from one of our US hubs or remotely in the United States.&nbsp;</p> <p><strong>What you'll do at Figma:</strong></p> <ul> <li>Partner across teams to define and track important metrics, develop experiments, and uncover insights that inform strategic decisions&nbsp;&nbsp;</li> <li>Accelerate Figma's experimentation platform and methodology, including A/B testing frameworks and causal inference techniques</li> <li>Construct models and analytical frameworks based on machine learning to support product, platform, and business initiatives&nbsp;&nbsp;</li> <li>Create tools, datasets, and systems that enable others to work with data more efficiently and rigorously</li> <li>Complete and own complex data projects end-to-end, from problem prioritisation to solution delivery&nbsp;&nbsp;</li> <li>Drive data quality, accessibility, and the democratization of data across the organization</li> </ul> <p><strong>We’d love to hear from you if you have:</strong></p> <ul> <li>PhD in a quantitative field (Statistics, Computer Science, Economics, Operations Research, Physics, or related) with a strong foundation in statistical methods, experimentation, and/or machine learning</li> <li>Fluency in SQL and proficiency in a scripting language like Python or R, with exposure to distributed data systems (e.g. Snowflake) through research or internships</li> <li>Ability to communicate technical concepts clearly to both technical and non-technical audiences</li> <li>A curious and rigorous mindset, with a passion for translating research into real-world impact</li> </ul> <p><strong>While it’s not required, it’s an added plus if you also have:</strong></p> <ul> <li>Publications or research experience in experimentation or applied ML; industry internship experience applying data science to product or business problems</li> <li>An AI-native mindset, with exposure to or interest in LLM analytics, AI product measurement, or evaluating the impact of AI-powered features</li> <li>A self-starter attitude and the ability to thrive in ambiguous and fast-paced environments</li> </ul> <div class=\"section page-centered\"> <div> <div class=\"section page-centered\"> <div> <div>At Figma, one of our values is Grow as you go. We believe in hiring smart, curious people who are excited to learn and develop their skills. If you’re excited about this role but your past experience doesn’t align perfectly with the points outlined in the job description, we encourage you to apply anyways. You may be just the right candidate for this or other roles.</div> </div> </div> </div> </div><div class=\"content-pay-transparency\"><div class=\"pay-input\"><div class=\"description\"><p><strong><span style=\"font-size: 16px;\">Pay Transparency Disclosure</span></strong></p> <p>If based in Figma’s San Francisco or New York hub offices, this role has the annual base salary range stated below.&nbsp;&nbsp;&nbsp;&nbsp;</p> <p>Job level and actual compensation will be decided based on factors including, but not limited to, individual qualifications objectively assessed during the interview process (including skills and prior relevant experience, potential impact, and scope of role), market demands, and specific work location. The listed range is a guideline, and the range for this role may be modified. For roles that are available to be filled remotely, the pay range is localized according to employee work location by a factor of between 80% and 100% of range. Please discuss your specific work location with your recruiter for more information.&nbsp;</p> <p>Figma offers equity to employees, as well a competitive package of additional benefits, including health, dental &amp; vision, retirement with company contribution, parental leave &amp; reproductive or family planning support, mental health &amp; wellness benefits, generous PTO, company recharge days, a learning &amp; development stipend, a work from home stipend, and cell phone reimbursement.&nbsp; Figma also offers sales incentive pay for most sales roles and an annual bonus plan for eligible non-sales roles. Figma’s compensation and benefits are subject to change and may be modified in the future.</p></div><div class=\"title\">Annual Base Salary Range:</div><div class=\"pay-range\"><span>$170,000</span><span class=\"divider\">&mdash;</span><span>$178,000 USD</span></div></div></div><div class=\"content-conclusion\"><p>At Figma we celebrate and support our differences. We know employing a team rich in diverse thoughts, experiences, and opinions allows our employees, our product and our community to flourish. Figma is an <a href=\"https://www.eeoc.gov/sites/default/files/2022-10/EEOC_KnowYourRights_screen_reader_10_20.pdf\">equal opportunity workplace</a> - we are dedicated to equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity/expression, veteran status<strong>, </strong>or any other characteristic protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.</p> <p>We will work to ensure individuals with disabilities are provided reasonable accommodation to apply for a role, participate in the interview process, perform essential job functions, and receive other benefits and privileges of employment. If you require accommodation, please reach out to <a href=\"mailto:accommodations-ext@figma.com\">accommodations-ext@figma.com</a>. These modifications enable an individual with a disability to have an equal opportunity not only to get a job, but successfully perform their job tasks to the same extent as people without disabilities.&nbsp;</p> <p>Examples of accommodations include but are not limited to:&nbsp;</p> <ul> <li>Holding interviews in an accessible location</li> <li>Enabling closed captioning on video conferencing</li> <li>Ensuring all written communication be compatible with screen readers</li> <li>Changing the mode or format of interviews&nbsp;</li> </ul> <p>To ensure the integrity of our hiring process and facilitate a more personal connection, we require all candidates keep their cameras on during video interviews. Additionally, if hired you will be required to attend in person onboarding.</p> <p>By applying for this job, the candidate acknowledges and agrees that any personal data contained in their application or supporting materials will be processed in accordance with <a class=\"c-link c-link--underline\" href=\"https://www.figma.com/legal/candidate-privacy-notice/\" target=\"_blank\" data-stringify-link=\"https://www.figma.com/legal/candidate-privacy-notice/\" data-sk=\"tooltip_parent\">Figma's Candidate Privacy Notice</a>.</p></div>",
            "apply_url": "https://boards.greenhouse.io/figma/jobs/5976930004?gh_jid=5976930004",
            "source": "greenhouse"
        },
        {
            "title": "Developer Advocate",
            "company": "Figma",
            "location": "San Francisco, CA",
            "remote": false,
            "description": "<div class=\"content-intro\"><p>Figma is growing our team of passionate creatives and builders on a mission to make design accessible to all. Figma’s platform helps teams bring ideas to life—whether you're brainstorming, creating a prototype, translating designs into code, or iterating with AI. From idea to product, Figma empowers teams to streamline workflows, move faster, and work together in real time from anywhere in the world. If you're excited to shape the future of design and collaboration, join us!</p></div><p>We're looking for a Developer Advocate to join our AMER Advocacy team, based in San Francisco. This advocate will support content creation, and community building through field events &amp; livestreams, go-to-market motions, product initiatives, and 1:1 customer engagements with our sales team. Advocates at Figma are practitioners first — people who understand the complexity of modern product development firsthand and can advocate for user needs internally.</p> <p>This is a hands-on role that blends technical depth, communication, and strong product intuition. You'll write production-quality code, build demos and tools, and directly support developers and technical teams. You'll also operate as a connective layer across functions — helping translate between customer needs, product decisions, and go-to-market execution.</p> <p>This role is available in the SF / Bay Area and is remote-friendly. You should be able to commute to the San Francisco office on a regular basis when needed. This role would require travel up to 25% of the time.&nbsp;</p> <h4>What you’ll do at Figma:</h4> <ul> <li> <ul> <li> <ul> <li><strong>Partner deeply with Sales and GTM</strong>: Join customer engagements, unblock technical teams, and support high-impact deals involving Dev Mode, design systems, and the Figma API.</li> <li><strong>Own regional technical presence</strong>: Play a leading role across AMER field events, activations, and key moments, representing Figma to developer audiences.</li> <li><strong>Build real things</strong>: Create production-quality demos, prototypes, and tools that reflect how modern teams actually build, not just toy examples.</li> <li><strong>Create scalable technical content</strong>: Author guides, code samples, and technical narratives that help developers adopt Figma in real workflows.</li> <li><strong>Influence product direction</strong>: Bring structured, high-signal feedback from customers and the community into Product. Help shape features before and after launch.</li> <li><strong>Bridge design and development</strong>: Help teams operationalize workflows across design systems, tokens, and developer tooling.</li> </ul> </li> </ul> </li> </ul> <div class=\"section page-centered\"> <p><strong>We'd love to hear from you if you:</strong></p> <ul> <li>8+ years of experience writing production code, with&nbsp; JavaScript/TypeScript and React; knowledge of Native languages is a nice to have.</li> <li>Are an excellent storyteller and communicator; you can tailor complex concepts to a variety of audiences</li> <li>Have experience working alongside designers or building tools that support design systems.</li> <li>Have strong product instincts; you understand not just how to build, but what to build and why.</li> <li>Are confident speaking publicly and able to engage large audiences can operate autonomously and love working cross-functionally across many different teams.</li> </ul> <p>At Figma, one of our values is Grow as you go. We believe in hiring smart, curious people who are excited to learn and develop their skills. If you’re excited about this role but your past experience doesn’t align perfectly with the points outlined in the job description, we encourage you to apply anyways. You may be just the right candidate for this or other roles.</p> </div><div class=\"content-pay-transparency\"><div class=\"pay-input\"><div class=\"description\"><p><strong><span style=\"font-size: 16px;\">Pay Transparency Disclosure</span></strong></p> <p>If based in Figma’s San Francisco or New York hub offices, this role has the annual base salary range stated below.&nbsp;&nbsp;&nbsp;&nbsp;</p> <p>Job level and actual compensation will be decided based on factors including, but not limited to, individual qualifications objectively assessed during the interview process (including skills and prior relevant experience, potential impact, and scope of role), market demands, and specific work location. The listed range is a guideline, and the range for this role may be modified. For roles that are available to be filled remotely, the pay range is localized according to employee work location by a factor of between 80% and 100% of range. Please discuss your specific work location with your recruiter for more information.&nbsp;</p> <p>Figma offers equity to employees, as well a competitive package of additional benefits, including health, dental &amp; vision, retirement with company contribution, parental leave &amp; reproductive or family planning support, mental health &amp; wellness benefits, generous PTO, company recharge days, a learning &amp; development stipend, a work from home stipend, and cell phone reimbursement.&nbsp; Figma also offers sales incentive pay for most sales roles and an annual bonus plan for eligible non-sales roles. Figma’s compensation and benefits are subject to change and may be modified in the future.</p></div><div class=\"title\">Annual Base Salary Range:</div><div class=\"pay-range\"><span>$153,000</span><span class=\"divider\">&mdash;</span><span>$317,000 USD</span></div></div></div><div class=\"content-conclusion\"><p>At Figma we celebrate and support our differences. We know employing a team rich in diverse thoughts, experiences, and opinions allows our employees, our product and our community to flourish. Figma is an <a href=\"https://www.eeoc.gov/sites/default/files/2022-10/EEOC_KnowYourRights_screen_reader_10_20.pdf\">equal opportunity workplace</a> - we are dedicated to equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity/expression, veteran status<strong>, </strong>or any other characteristic protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.</p> <p>We will work to ensure individuals with disabilities are provided reasonable accommodation to apply for a role, participate in the interview process, perform essential job functions, and receive other benefits and privileges of employment. If you require accommodation, please reach out to <a href=\"mailto:accommodations-ext@figma.com\">accommodations-ext@figma.com</a>. These modifications enable an individual with a disability to have an equal opportunity not only to get a job, but successfully perform their job tasks to the same extent as people without disabilities.&nbsp;</p> <p>Examples of accommodations include but are not limited to:&nbsp;</p> <ul> <li>Holding interviews in an accessible location</li> <li>Enabling closed captioning on video conferencing</li> <li>Ensuring all written communication be compatible with screen readers</li> <li>Changing the mode or format of interviews&nbsp;</li> </ul> <p>To ensure the integrity of our hiring process and facilitate a more personal connection, we require all candidates keep their cameras on during video interviews. Additionally, if hired you will be required to attend in person onboarding.</p> <p>By applying for this job, the candidate acknowledges and agrees that any personal data contained in their application or supporting materials will be processed in accordance with <a class=\"c-link c-link--underline\" href=\"https://www.figma.com/legal/candidate-privacy-notice/\" target=\"_blank\" data-stringify-link=\"https://www.figma.com/legal/candidate-privacy-notice/\" data-sk=\"tooltip_parent\">Figma's Candidate Privacy Notice</a>.</p></div>",
            "apply_url": "https://boards.greenhouse.io/figma/jobs/5834922004?gh_jid=5834922004",
            "source": "greenhouse"
        },
        {
            "title": "Developer Advocate (Tokyo, Japan)",
            "company": "Figma",
            "location": "Tokyo, Japan",
            "remote": false,
            "description": "<div class=\"content-intro\"><p>Figma is growing our team of passionate creatives and builders on a mission to make design accessible to all. Figma’s platform helps teams bring ideas to life—whether you're brainstorming, creating a prototype, translating designs into code, or iterating with AI. From idea to product, Figma empowers teams to streamline workflows, move faster, and work together in real time from anywhere in the world. If you're excited to shape the future of design and collaboration, join us!</p></div><p>We're looking for a Developer Advocate to join our JAPAC Advocacy team, based in Tokyo. This advocate will support content creation, and community building through field events &amp; livestreams, go-to-market motions, product initiatives, and 1:1 customer engagements with our sales team. Advocates at Figma are practitioners first - people who understand the complexity of modern product development firsthand and can advocate for user needs internally.</p> <p>This is a hands-on role that blends technical depth, communication, and strong product intuition. You'll write production-quality code, build demos and tools, and directly support developers and technical teams. You'll also operate as a connective layer across functions - helping translate between customer needs, product decisions, and go-to-market execution.</p> <p>This is a full time role in our Tokyo office, within a hybrid environment.</p> <h4>What you’ll do at Figma:</h4> <div class=\"section page-centered\"> <ul> <li>Partner deeply with Sales and GTM: Join customer engagements, unblock technical teams, and support high-impact deals involving Dev Mode, design systems, and the Figma API</li> <li>Own regional technical presence: Play a leading role across JAPAC field events, activations, and key moments, representing Figma to developer audiences</li> <li>Build real things: Create production-quality demos, prototypes, and tools that reflect how modern teams actually build, not just toy examples</li> <li>Create scalable technical content: Author guides, code samples, and technical narratives that help developers adopt Figma in real workflows</li> <li>Influence product direction: Bring structured, high-signal feedback from customers and the community into Product. Help shape features before and after launch</li> <li>Bridge design and development: Help teams operationalize workflows across design systems, tokens, and developer tooling</li> </ul> </div> <div class=\"section page-centered\"> <h4>We'd love to hear from you if you have:</h4> <ul> <li>5+ years technical front-end knowledge (should be committing code to a production level environment, not self taught alone)</li> <li>English fluency (+ ideally one other language)</li> <li>Strong written and verbal communication skills</li> <li>Experience in creating technical content/delivering talks</li> <li>Experience in product development workflows especially collaborating with design teams</li> <li>Background in growth SaaS business</li> </ul> <h4><strong>While it’s not required, it’s an added plus if you also have:</strong></h4> <ul> <li>Design experience</li> <li>Knowledgeable about design team + design systems workflows</li> <li>Experience developing (Figma) plugins</li> </ul> <div class=\"section page-centered\"> <div> <div>At Figma, one of our values is Grow as you go. We believe in hiring smart, curious people who are excited to learn and develop their skills. If you’re excited about this role but your past experience doesn’t align perfectly with the points outlined in the job description, we encourage you to apply anyways. You may be just the right candidate for this or other roles.</div> </div> </div> </div><div class=\"content-conclusion\"><p>At Figma we celebrate and support our differences. We know employing a team rich in diverse thoughts, experiences, and opinions allows our employees, our product and our community to flourish. Figma is an <a href=\"https://www.eeoc.gov/sites/default/files/2022-10/EEOC_KnowYourRights_screen_reader_10_20.pdf\">equal opportunity workplace</a> - we are dedicated to equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity/expression, veteran status<strong>, </strong>or any other characteristic protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.</p> <p>We will work to ensure individuals with disabilities are provided reasonable accommodation to apply for a role, participate in the interview process, perform essential job functions, and receive other benefits and privileges of employment. If you require accommodation, please reach out to <a href=\"mailto:accommodations-ext@figma.com\">accommodations-ext@figma.com</a>. These modifications enable an individual with a disability to have an equal opportunity not only to get a job, but successfully perform their job tasks to the same extent as people without disabilities.&nbsp;</p> <p>Examples of accommodations include but are not limited to:&nbsp;</p> <ul> <li>Holding interviews in an accessible location</li> <li>Enabling closed captioning on video conferencing</li> <li>Ensuring all written communication be compatible with screen readers</li> <li>Changing the mode or format of interviews&nbsp;</li> </ul> <p>To ensure the integrity of our hiring process and facilitate a more personal connection, we require all candidates keep their cameras on during video interviews. Additionally, if hired you will be required to attend in person onboarding.</p> <p>By applying for this job, the candidate acknowledges and agrees that any personal data contained in their application or supporting materials will be processed in accordance with <a class=\"c-link c-link--underline\" href=\"https://www.figma.com/legal/candidate-privacy-notice/\" target=\"_blank\" data-stringify-link=\"https://www.figma.com/legal/candidate-privacy-notice/\" data-sk=\"tooltip_parent\">Figma's Candidate Privacy Notice</a>.</p></div>",
            "apply_url": "https://boards.greenhouse.io/figma/jobs/5969647004?gh_jid=5969647004",
            "source": "greenhouse"
        },
        {
            "title": "IT Engineer (London, United Kingdom)",
            "company": "Figma",
            "location": "London, England",
            "remote": false,
            "description": "<div class=\"content-intro\"><p>Figma is growing our team of passionate creatives and builders on a mission to make design accessible to all. Figma’s platform helps teams bring ideas to life—whether you're brainstorming, creating a prototype, translating designs into code, or iterating with AI. From idea to product, Figma empowers teams to streamline workflows, move faster, and work together in real time from anywhere in the world. If you're excited to shape the future of design and collaboration, join us!</p></div><div class=\"section page-centered\"> <p>As a member of the IT Engineering team, you’ll collaborate closely with IT Operations, Security, and cross-functional partners to develop, manage, and secure Figma’s internal IT services and employee device experience. This role is primarily focused on endpoint management and security posture—especially for macOS—with a strong emphasis on automation, reliable software delivery, and configuration-as-code practices.</p> <p>You’ll partner across IT and Security to design and run repeatable endpoint workflows that keep devices secure, compliant, and easy to support.</p> <h4><strong>What you'll do at Figma:</strong></h4> <ul> <li>Contribute to the ongoing management and improvement of our macOS endpoint program: provisioning, enrollment, configuration, compliance, patching, troubleshooting, and deprovisioning</li> <li>Build and maintain software deployment and update workflows with safe rollout patterns (pilot → staged → broad), measurable success criteria, and clear rollback plans</li> <li>Develop automation using Bash/Python, APIs, and Git-based workflows to reduce repetitive work and improve reliability (e.g., lifecycle tasks, reporting, drift detection/remediation, self-service enablement)</li> <li>Implement and operationalize endpoint security controls in partnership with Security (secure configuration baselines, permissions/PPPC/TCC strategy, posture validation concepts, response playbooks)</li> <li>Improve operational rigor: documentation, runbooks, change management, and incident follow-through/retrospectives</li> <li>Communicate endpoint changes clearly to impacted audiences (what’s changing, why, what users might see, and how to get help)</li> <li>Work in a “configuration as code” mindset where applicable: PR-based changes, peer review, and traceable deployments using tools like GitHub, Terraform, YAML, or similar</li> <li>Collaborate effectively on office connectivity initiatives by providing working familiarity with Meraki (cloud-managed networking concepts and dashboard fundamentals) and coordinating with internal partners and external providers when needed</li> </ul> <h4>We’d love to hear from you if you have:</h4> <ul> <li>Significant hands-on experience managing macOS endpoints in an enterprise environment (typically 5+ years, or equivalent depth of responsibility)</li> <li>Strong experience administering a modern MDM / endpoint management platform (policies/profiles, packaging/software deployment, enrollment flows, scoping strategies, troubleshooting). Experience with tools such as Jamf Pro, Fleet, Kandji, Intune, Workspace ONE, or similar. Solid understanding of macOS security and management fundamentals (MDM concepts, certificates, PPPC/TCC, OS updates, compliance posture, IDE management)</li> <li>Proficiency in Bash and/or Python, plus comfort working with APIs, logs, and structured data</li> <li>Comfortable with GitOps/configuration-as-code workflows (GitHub, Terraform/YAML, CI-friendly change management)</li> <li>Working familiarity with Meraki and cloud-managed networking concepts (enough to partner effectively with specialists/vendors, not to be the dedicated network owner)</li> </ul> <h4>While it’s not required, it’s an added plus if you also have:</h4> <ul> <li>Experience with identity-adjacent endpoint controls (device posture/device trust concepts; integrations with IdPs such as Okta)</li> <li>Familiarity with endpoint visibility/telemetry tooling and fleet reporting (query-based inventory, EDR/SIEM integrations)</li> <li>Demonstrated proficiency in improving or modernizing endpoint management programs (tooling evaluation, rollout strategy, change management) with minimal end-user disruption</li> <li>Experience operating in a global environment with distributed offices and vendor-supported onsite infrastructure</li> <li>Exposure to managing configurations for Chrome and Android through Google Workspace.</li> </ul> <div class=\"section page-centered\"> <div> <div>At Figma, one of our values is Grow as you go. We believe in hiring smart, curious people who are excited to learn and develop their skills. If you’re excited about this role but your past experience doesn’t align perfectly with the points outlined in the job description, we encourage you to apply anyways. You may be just the right candidate for this or other roles.</div> </div> </div> </div><div class=\"content-conclusion\"><p>At Figma we celebrate and support our differences. We know employing a team rich in diverse thoughts, experiences, and opinions allows our employees, our product and our community to flourish. Figma is an <a href=\"https://www.eeoc.gov/sites/default/files/2022-10/EEOC_KnowYourRights_screen_reader_10_20.pdf\">equal opportunity workplace</a> - we are dedicated to equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity/expression, veteran status<strong>, </strong>or any other characteristic protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.</p> <p>We will work to ensure individuals with disabilities are provided reasonable accommodation to apply for a role, participate in the interview process, perform essential job functions, and receive other benefits and privileges of employment. If you require accommodation, please reach out to <a href=\"mailto:accommodations-ext@figma.com\">accommodations-ext@figma.com</a>. These modifications enable an individual with a disability to have an equal opportunity not only to get a job, but successfully perform their job tasks to the same extent as people without disabilities.&nbsp;</p> <p>Examples of accommodations include but are not limited to:&nbsp;</p> <ul> <li>Holding interviews in an accessible location</li> <li>Enabling closed captioning on video conferencing</li> <li>Ensuring all written communication be compatible with screen readers</li> <li>Changing the mode or format of interviews&nbsp;</li> </ul> <p>To ensure the integrity of our hiring process and facilitate a more personal connection, we require all candidates keep their cameras on during video interviews. Additionally, if hired you will be required to attend in person onboarding.</p> <p>By applying for this job, the candidate acknowledges and agrees that any personal data contained in their application or supporting materials will be processed in accordance with <a class=\"c-link c-link--underline\" href=\"https://www.figma.com/legal/candidate-privacy-notice/\" target=\"_blank\" data-stringify-link=\"https://www.figma.com/legal/candidate-privacy-notice/\" data-sk=\"tooltip_parent\">Figma's Candidate Privacy Notice</a>.</p></div>",
            "apply_url": "https://boards.greenhouse.io/figma/jobs/5813865004?gh_jid=5813865004",
            "source": "greenhouse"
        }
    ]
}

TODOS

  • currently latency is ~15s improve that
  • cleaning html from the json response

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced CVPilot Job Scraper microservice with plugin-based architecture for fetching and filtering jobs
    • Added /health endpoint to monitor service status and available job sources
    • Added /internal/ingest endpoint to fetch, normalize, and rank jobs by relevance with optional filtering by sources, companies, and limits
    • Implemented intelligent job filtering pipeline supporting user preferences (skills, preferred roles, location, remote-only preference)
    • Added Greenhouse job board integration as primary source
    • Integrated structured JSON logging for all operations
    • Enabled rate limiting, request retries, and resilient HTTP handling
  • Documentation

    • Comprehensive service documentation with API examples and configuration guide

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

📝 Walkthrough

Walkthrough

This PR introduces CVPilot Job Scraper, a production-ready FastAPI microservice with a plugin-based source registry, a 5-stage relevance-filtering pipeline, async job fetching with resilience, and structured JSON logging.

Changes

Complete Job Scraper Microservice

Layer / File(s) Summary
Type & Data Models
scrapper/models/job_schema.py, scrapper/utils/exceptions.py, scrapper/service/scoring.py
JobData, IngestionRequest/Response Pydantic models, custom exception hierarchy (ScraperException, SourceException, ConfigException, NetworkException, ValidationException), and ScoringConfig/JobScore/FilterResult dataclasses.
Configuration & Infrastructure
scrapper/config/loader.py, scrapper/utils/http_client.py, scrapper/utils/logger.py
Configuration loaders for companies.json and environment variables, async HttpClient with retry/backoff and rate limiting, structured JSON logging with JsonFormatter and lifecycle helpers.
Base Abstractions
scrapper/sources/base.py
Abstract JobSource interface defining source_name property, async fetch_jobs, and normalize_job contract.
Core Pipeline Logic
scrapper/service/scoring.py, scrapper/service/job_filter.py
Five-stage job filtering and ranking pipeline: cheap pre-filtering, weighted relevance scoring with component breakdown, dynamic threshold filtering, sorting by score, and top-K truncation. Includes helpers for keyword extraction, role/skill matching, and location/remote preference logic.
Source Implementations
scrapper/sources/greenhouse.py
GreenhouseSource fetches jobs via Greenhouse API with list caching, scoring-based ranking, parallel detail fetching, and normalization to JobData including HTML cleaning and remote detection.
Source Registry & Package Setup
scrapper/sources/__init__.py, scrapper/service/__init__.py, scrapper/config/__init__.py, scrapper/api/__init__.py, scrapper/__init__.py, scrapper/utils/__init__.py
SourceRegistry factory for plugin registration/retrieval, module-level re-exports and __all__ definitions, package initialization.
API Routes & Application
scrapper/api/routes.py, scrapper/main.py
FastAPI routes for GET /health (service status and available sources) and POST /internal/ingest (orchestrates config loading, per-source/company async fetching, pipeline execution, result limiting, timing/logging). Main app with lifespan context manager, CORS middleware, exception handler, and Uvicorn startup.
Configuration & Requirements
scrapper/companies.json, scrapper/requirements.txt, scrapper/.gitignore, scrapper/conftest.py
Company mappings for Greenhouse, pinned Python dependencies, Python ignore rules, and pytest path setup.
Documentation & Tests
IMPLEMENTATION_SUMMARY.md, scrapper/README.md, scrapper/sources/README.md, scrapper/tests/conftest.py, scrapper/tests/test_api.py, scrapper/tests/test_filtering.py, scrapper/tests/test_greenhouse.py, scrapper/tests/test_sources.py, scrapper/tests/__init__.py
Complete implementation architecture and design documentation, quickstart and API guide, source integration guide, pytest fixtures (mocked HTTP client, cache isolation, config), and test suites for API endpoints, filtering pipeline, Greenhouse source, and registry.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant API as FastAPI<br/>Ingest Route
    participant Loader as Config<br/>Loader
    participant Registry as Source<br/>Registry
    participant Greenhouse as Greenhouse<br/>Source
    participant Pipeline as Filtering<br/>Pipeline
    participant Response as Response

    Client->>API: POST /internal/ingest<br/>(sources?, companies?, limit_per_company?)
    
    API->>Loader: load_companies()
    Loader-->>API: {greenhouse: [stripe, ...]}
    
    API->>Registry: get("greenhouse")
    Registry-->>API: GreenhouseSource()
    
    API->>Greenhouse: fetch_jobs(company=stripe, ...)
    Greenhouse->>Greenhouse: list + cache<br/>(get cached job list)
    Greenhouse->>Greenhouse: score_rank<br/>(filter/sort candidates)
    Greenhouse->>Greenhouse: fetch details<br/>(parallel async requests)
    Greenhouse-->>API: [{raw_job...}, ...]
    
    API->>Greenhouse: normalize_job(raw_job)<br/>(per job)
    Greenhouse-->>API: JobData{title, company, ...}
    
    API->>Pipeline: filter_and_rank_jobs<br/>(jobs, user_context?, limit)
    Pipeline->>Pipeline: Stage 1: cheap_filter
    Pipeline->>Pipeline: Stage 2: score_job
    Pipeline->>Pipeline: Stage 3: threshold_filter
    Pipeline->>Pipeline: Stage 4: sort by score desc
    Pipeline->>Pipeline: Stage 5: top-K slice
    Pipeline-->>API: {total, jobs[], pipeline_summary}
    
    API-->>Client: IngestionResponse{total, jobs[]}
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Whiskers twitching with delight,
A scraper hops from source to site!
Five stages filter, rank, refine,
Greenhouse jobs now brightly shine.
With async grace and tests so bright,
The burrow's work is done just right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'built custom py scrapper' is vague and generic, using non-descriptive terms that don't convey the actual scope or purpose of the changeset. Replace with a more specific title that describes the main implementation, such as 'Add CVPilot Job Scraper microservice with Greenhouse integration' or 'Implement job scraper with filtering pipeline and API endpoints'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 95.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch parsing-engine

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@pullsharksite
Copy link
Copy Markdown

pullsharksite Bot commented May 2, 2026

🦈 PullShark AI Analysis

Risk Level: 🔴 High

🧪 Recommended Tests

  • Functional Verification: Verify successful scraping and accurate data extraction from a diverse set of target job boards.
  • Data Integrity Testing: Validate all expected data fields (title, company, location, description, URL, etc.) are correctly extracted and formatted, handling missing or null values gracefully.
  • Pagination & Navigation Testing: Test different pagination types and infinite scroll mechanisms on various websites.
  • Error Handling & Resilience: Test scraper behavior when target websites are down, return 404/500 errors, or change their HTML structure.
  • Rate Limiting & Throttling: Verify the scraper's ability to respect website rate limits and gracefully handle temporary blocks or retries.
  • Performance & Scalability: Measure scraping speed, resource utilization (CPU, memory, network), and ensure stability under high load or long-running operations.
  • Data Sanitization: Confirm that all scraped data, especially free-text fields like job descriptions, is properly sanitized before storage or display to prevent injection attacks (e.g., XSS).
  • Logging & Monitoring: Verify that errors, warnings, and successful operations are logged appropriately, and relevant metrics are available for monitoring.
  • Negative Testing: Attempt to scrape invalid URLs, non-job pages, or pages with intentionally malformed content.
  • Integration Testing: Ensure seamless data flow and correct interaction with any downstream parsing engines, databases, or APIs.

⚠️ Edge Cases & Security

  • Target website HTML structure changes unexpectedly.
  • Website implements CAPTCHA or sophisticated bot detection mechanisms.
  • Website rate-limits or blocks scraper IP addresses.
  • Job listings contain malformed HTML or missing expected data fields.
  • Pagination schemes vary (e.g., next/previous buttons, infinite scroll, 'load more' buttons).
  • Dynamic content loading via JavaScript (if not handled by the scraper).
  • Very large number of job listings on a single page or across multiple pages affecting performance/memory.
  • Scraping from different geographic regions or languages (if applicable).
  • Job postings require login or specific session management.
  • Network interruptions or target website unavailability during scraping.
  • Potential for scraper to cause Denial of Service (DoS) to target websites due to aggressive requests.
  • Risk of IP blocking or legal action from target websites if scraping practices are deemed abusive.
  • If authentication is used, potential for credential leakage or insecure handling.
  • Risk of scraping malicious content (e.g., XSS in job descriptions) that could impact downstream applications if not properly sanitized.
  • Vulnerabilities in third-party Python libraries used by the scraper.
  • Exposure of internal infrastructure details if error logging is too verbose or unhandled exceptions occur.

Generated by PullShark AI

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 21

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@IMPLEMENTATION_SUMMARY.md`:
- Around line 6-7: The Test Coverage headline in IMPLEMENTATION_SUMMARY.md
("Test Coverage: 46/46 tests PASSING") is inconsistent with the listed
file-level totals; update the document so the headline total matches the sum of
the individual file totals (or vice versa) by recalculating and editing the
"Test Coverage" line and the per-file counts, and also reconcile the same
mismatch noted around lines 1037-1042; ensure the final numeric total and the
numerator/denominator values are consistent across the summary.

In `@scrapper/api/routes.py`:
- Around line 99-100: The handler collects an errors list but never uses it to
determine response status, so calls like the route function that populate errors
and jobs (look for the variables errors and jobs in the route handler) can
return 200 even if all source fetches failed; fix by checking errors after
processing: if errors.length === companies.length (i.e., no successful jobs)
return an error status (e.g., 500 or 400) with the errors payload; if some
succeeded and some failed return a 207 Multi-Status (or 200 with a clear
partial-success payload) including both jobs and errors; update all similar
blocks that populate errors/jobs (the other occurrences you flagged) to
implement this logic and replace the unconditional return of jobs with a
conditional return that sets the appropriate status and includes errors.
- Around line 165-179: The code is treating request.limit_per_company as a
global cap on merged results; instead preserve it as a per-company cap and
enforce it during per-company selection/merging. Remove the global application
of result_limit before calling filtering_service.filter_and_rank_jobs and
instead pass the original request.limit_per_company (validated/capped by
MAX_RESULT_LIMIT if needed) as a named per_company_limit argument to
filtering_service.filter_and_rank_jobs (or update filter_and_rank_jobs to accept
per_company_limit), so filtering logic (inside filter_and_rank_jobs) can limit
each company's entries individually before merging into the final result set;
keep MAX_RESULT_LIMIT enforcement only as validation of the requested
per-company value (use request.limit_per_company -> per_company_limit variable,
cap that at MAX_RESULT_LIMIT, log if capped, then pass per_company_limit into
filter_and_rank_jobs).
- Around line 208-213: The except Exception as e block in scrapper/api/routes.py
currently returns the raw exception to clients; instead, log the full error
details (use logger.exception or logger.error with exc_info=True) inside that
except block to preserve stacktrace, and raise HTTPException(status_code=500,
detail="Internal server error") (or another generic message) so clients do not
receive internal exception text; keep references to the existing logger and
HTTPException symbols and replace the current f"Job ingestion failed: {str(e)}"
detail with the generic message while retaining the detailed error in the logs.

In `@scrapper/companies.json`:
- Around line 2-198: The companies list in scrapper/companies.json contains
duplicate slugs (e.g., "stripe", "hashicorp", "datadog", "pinterest", "asana",
"notion", etc.) which will produce duplicate scrapes; remove duplicate entries
from the JSON source so each company slug appears only once, and also harden
load_companies() to dedupe on ingestion (e.g., convert the loaded array to a Set
or use Array.from(new Set(...))) to prevent future duplicates slipping in.

In `@scrapper/config/loader.py`:
- Around line 6-7: The imports in scrapper/config/loader.py are using top-level
module paths and will fail when scrapper.config is imported; update the import
statements that reference ConfigException and get_logger to use package-relative
imports (e.g., from ..utils.exceptions import ConfigException and from
..utils.logger import get_logger) or fully-qualified package paths
(scrapper.utils.exceptions.ConfigException, scrapper.utils.logger.get_logger) so
the symbols ConfigException and get_logger are resolved correctly when loader.py
is loaded.
- Around line 66-80: load_config currently converts env strings to int/float and
will raise raw ValueError on malformed values; wrap the conversions for
"HTTP_TIMEOUT", "MAX_RETRIES", "RETRY_BACKOFF_FACTOR", and "REQUESTS_PER_SECOND"
in a try/except that catches ValueError and raises ConfigException with a clear
message (include the env var name and original error) so startup fails with a
ConfigException instead of an unhandled ValueError; keep using getenv for
defaults and ensure the boolean DEBUG parsing remains unchanged.

In `@scrapper/main.py`:
- Around line 51-58: The CORS config currently uses app.add_middleware with
CORSMiddleware and allow_origins=["*"] plus allow_credentials=True; change this
to read allowed origins from configuration/env (e.g., a FRONTEND_ORIGINS or
CORS_ALLOWED_ORIGINS setting) and set allow_origins to that enumerated list,
ensuring allow_credentials remains true only when origins are explicit; update
the code path that constructs the CORSMiddleware (the app.add_middleware call)
to parse a comma-separated env var or config list and use it instead of ["*"],
and add fallback to a safe default (empty list or explicit localhost) for
non-production.
- Around line 79-86: The global_exception_handler currently returns the
exception text to clients; change it to return only a generic error payload
(e.g., {"error":"Internal server error"}) and stop including str(exc) in the
JSONResponse, and instead log the full traceback server-side using
logger.exception(...) or traceback.format_exc() before returning the response so
the details are captured in logs but not echoed to clients; update the
global_exception_handler function accordingly.

In `@scrapper/models/job_schema.py`:
- Around line 74-77: The docs/examples disagree with the declared max for
limit_per_company (description says max 12 but examples send 50) — fix by
enforcing the contract and keeping docs in sync: change the type of
limit_per_company to use a constrained int (e.g., conint(le=12) or a pydantic
validator in the Job schema) so values above 12 raise validation errors, and
update the Field description/examples mentioned at lines ~88-89 to reflect the
same max (or vice-versa if intended max is 50 — then update the Field
description to "max: 50" consistently). Reference: limit_per_company (and the
other similar field noted at 88-89) so both validation and description/examples
match.

In `@scrapper/service/job_filter.py`:
- Around line 138-141: The current truthy check treats limit=0 as False and
returns all jobs; change the condition in the block that assigns top_jobs (using
variables sorted_jobs, top_jobs, limit) to explicitly check for None (e.g., if
limit is not None) so that an integer 0 correctly results in an empty list slice
(top_jobs = sorted_jobs[:limit]) and does not fall back to returning the full
sorted_jobs; optionally validate that limit is an int >= 0 before slicing to
avoid negative or non-integer behavior.

In `@scrapper/service/scoring.py`:
- Around line 129-131: The current substring checks using keywords and
text_lower are too permissive; replace them with whole-word matching (e.g., use
regex word-boundaries or tokenization) so you only add a keyword to matched when
it appears as a distinct token. Concretely: for the loops that iterate over
keywords and check "if keyword in text_lower" (the matched set update using
variables keywords, text_lower, matched), change to either compile a
case-insensitive pattern using word boundaries around re.escape(keyword) and
test with re.search, or tokenize text_lower into words and check membership;
also treat very short keywords (<=2 chars) more strictly (require boundaries or
skip/validate) to avoid false positives. Apply the same change to the other
occurrences referenced (the checks at the other two locations).
- Around line 327-330: The "flexible remote" branch uses integer floor division
so remote_match_weight=1 yields zero; in the elif that checks user_remote_only
and job_remote (the block updating score and
job_score.breakdown["remote_flexible"]), change the calculation to ensure a
non-zero bonus for odd weights — e.g., use a proper half with rounding
(math.ceil(config.remote_match_weight / 2) or (config.remote_match_weight + 1)
// 2) or compute as float then cast — and apply that value both to score and
job_score.breakdown["remote_flexible"] instead of config.remote_match_weight //
2.
- Around line 271-273: Guard against None before constructing sets: replace the
direct conversions in scoring.py so user_skills and user_roles use a safe
default (e.g. user_skills = set(user_context.get("skills") or []) and user_roles
= set(user_context.get("preferred_roles") or [])); keep role_keywords logic
(role_keywords = user_roles if user_roles else config.role_keywords) so that an
empty or None input falls back to config.role_keywords.

In `@scrapper/sources/greenhouse.py`:
- Around line 43-46: The constructor currently creates a new HttpClient per
instance (in GreenhouseSource.__init__ via self.http_client = HttpClient()),
which can leak connection pools; change the constructor to accept a HttpClient
instance (e.g., http_client) and assign self.http_client = http_client (falling
back to the lifecycle-managed/shared client if none provided) so callers reuse
the shared, lifecycle-managed HttpClient instead of allocating a new one per
GreenhouseSource.

In `@scrapper/sources/README.md`:
- Around line 52-54: The README example raises SourceException with a single
positional argument but the SourceException constructor requires both source and
message; update the except block to call SourceException with both parameters
(e.g., SourceException(source="Workable", message=f"Failed to fetch from
Workable: {e}") or SourceException("Workable", f"Failed to fetch from Workable:
{e}")) so using the SourceException class signature matches its implementation.

In `@scrapper/tests/test_api.py`:
- Around line 34-71: The tests in scrapper/tests/test_api.py
(test_ingest_endpoint_no_request_body,
test_ingest_endpoint_response_schema_structure, and related ingestion tests) are
non-deterministic because they hit live downstream fetchers; replace that by
stubbing/mocking the downstream client/source used by the ingest endpoint (the
module/class/function the route calls to fetch companies/jobs) via the test
client/pytest monkeypatch or a fixture so the endpoint receives a deterministic
payload (e.g., a known companies/jobs response) and always returns 200; then
tighten assertions to require a 200 and validate the exact expected JSON schema
(presence and types for "total" and "jobs") instead of allowing transient
429/408/504—keep test_ingest_endpoint_invalid_source asserting 400 for unknown
sources.

In `@scrapper/tests/test_filtering.py`:
- Around line 233-244: The test test_score_combined is too permissive — replace
the loose assertion "assert scored.score >= 10" with an exact equality check to
catch weighting regressions (e.g. assert scored.score == 10); update the
assertion in scrapper/tests/test_filtering.py where score_job(test_job,
user_context) is invoked and the resulting scored.score is checked so the test
validates the precise combined total from score_job rather than allowing
accidental extra points.

In `@scrapper/utils/http_client.py`:
- Around line 95-97: The rate limiter is not concurrency-safe because
last_request_time is mutated without synchronization; add an asyncio.Lock (e.g.
self._rate_lock) in the HttpClient initializer and wrap the logic inside
_enforce_rate_limit with that lock: acquire the lock, compute now and required
wait = interval - (now - self.last_request_time), if wait > 0 await
asyncio.sleep(wait), then update self.last_request_time = time.monotonic() (or
now + wait) and release the lock so only one coroutine updates the timestamp at
a time; apply the same locking change wherever _enforce_rate_limit is called
(and ensure the lock is created and used consistently in the HttpClient class).
- Around line 45-46: The constructor currently computes
self.min_request_interval = 1.0 / requests_per_second which raises
ZeroDivisionError for requests_per_second <= 0; update the HttpClient __init__
to validate requests_per_second > 0 up front (e.g., if requests_per_second <= 0:
raise ValueError("requests_per_second must be > 0")) before computing
self.min_request_interval and setting self.last_request_time so initialization
fails fast with a clear error message.

In `@scrapper/utils/logger.py`:
- Around line 13-36: The JsonFormatter in scrapper/utils/logger.py currently
only serializes source/company/status/duration_ms/job_count and is dropping the
additional metadata emitted by log_ingest_start(), log_source_fetch(), and
log_ingest_complete() (ingest counters and error payloads); update JsonFormatter
to check for and include the ingest-related fields (e.g., any record attributes
like ingest_counters, ingest_stats, ingest_count, ingest_errors, error or
error_payload) and serialize them into log_data when present, or alternatively
adjust the helper functions
(log_ingest_start/log_source_fetch/log_ingest_complete) to use the existing
field names the formatter expects so the structured metadata is preserved.
Ensure you reference JsonFormatter in scrapper/utils/logger.py and the helper
functions log_ingest_start, log_source_fetch, and log_ingest_complete when
making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bf8965c8-ac6f-41f5-8116-cd19f1f98d04

📥 Commits

Reviewing files that changed from the base of the PR and between aa036df and ab3d1b0.

📒 Files selected for processing (31)
  • IMPLEMENTATION_SUMMARY.md
  • scrapper/.gitignore
  • scrapper/README.md
  • scrapper/__init__.py
  • scrapper/api/__init__.py
  • scrapper/api/routes.py
  • scrapper/companies.json
  • scrapper/config/__init__.py
  • scrapper/config/loader.py
  • scrapper/conftest.py
  • scrapper/main.py
  • scrapper/models/__init__.py
  • scrapper/models/job_schema.py
  • scrapper/requirements.txt
  • scrapper/service/__init__.py
  • scrapper/service/job_filter.py
  • scrapper/service/scoring.py
  • scrapper/sources/README.md
  • scrapper/sources/__init__.py
  • scrapper/sources/base.py
  • scrapper/sources/greenhouse.py
  • scrapper/tests/__init__.py
  • scrapper/tests/conftest.py
  • scrapper/tests/test_api.py
  • scrapper/tests/test_filtering.py
  • scrapper/tests/test_greenhouse.py
  • scrapper/tests/test_sources.py
  • scrapper/utils/__init__.py
  • scrapper/utils/exceptions.py
  • scrapper/utils/http_client.py
  • scrapper/utils/logger.py

Comment thread IMPLEMENTATION_SUMMARY.md
Comment on lines +6 to +7
**Test Coverage**: 46/46 tests PASSING
**Date**: April 2026
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test-count section is internally inconsistent.

The headline says 46/46, but the listed file-level totals exceed that. Please reconcile these numbers so the summary remains reliable.

Also applies to: 1037-1042

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@IMPLEMENTATION_SUMMARY.md` around lines 6 - 7, The Test Coverage headline in
IMPLEMENTATION_SUMMARY.md ("Test Coverage: 46/46 tests PASSING") is inconsistent
with the listed file-level totals; update the document so the headline total
matches the sum of the individual file totals (or vice versa) by recalculating
and editing the "Test Coverage" line and the per-file counts, and also reconcile
the same mismatch noted around lines 1037-1042; ensure the final numeric total
and the numerator/denominator values are consistent across the summary.

Comment thread scrapper/api/routes.py
Comment on lines +99 to +100
errors: List[str] = []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ingestion can return success even when all source fetches fail.

errors are collected, but never influence response status. If all companies fail, this still returns 200 with empty jobs.

Suggested fix
         logger.info(
             "Fetch and normalization completed",
             extra={
                 "total_jobs_fetched": len(all_jobs),
                 "fetch_duration_ms": fetch_duration_ms
             }
         )
+
+        if errors and not all_jobs:
+            raise HTTPException(
+                status_code=502,
+                detail="All source fetches failed"
+            )

Also applies to: 127-143, 201-204

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapper/api/routes.py` around lines 99 - 100, The handler collects an errors
list but never uses it to determine response status, so calls like the route
function that populate errors and jobs (look for the variables errors and jobs
in the route handler) can return 200 even if all source fetches failed; fix by
checking errors after processing: if errors.length === companies.length (i.e.,
no successful jobs) return an error status (e.g., 500 or 400) with the errors
payload; if some succeeded and some failed return a 207 Multi-Status (or 200
with a clear partial-success payload) including both jobs and errors; update all
similar blocks that populate errors/jobs (the other occurrences you flagged) to
implement this logic and replace the unconditional return of jobs with a
conditional return that sets the appropriate status and includes errors.

Comment thread scrapper/api/routes.py
Comment on lines +165 to +179
result_limit = request.limit_per_company
if result_limit is None:
result_limit = DEFAULT_RESULT_LIMIT
elif result_limit > MAX_RESULT_LIMIT:
result_limit = MAX_RESULT_LIMIT
logger.info(
f"Result limit capped at {MAX_RESULT_LIMIT} (requested: {request.limit_per_company})",
extra={"requested": request.limit_per_company, "capped_at": MAX_RESULT_LIMIT}
)

filter_result = filtering_service.filter_and_rank_jobs(
all_jobs,
user_context=user_context_dict,
limit=result_limit
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

limit_per_company is applied as a global cap, not per company.

Current behavior limits final merged results, so requests spanning multiple companies can return far fewer jobs than implied by the field name/contract.

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 171-171: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapper/api/routes.py` around lines 165 - 179, The code is treating
request.limit_per_company as a global cap on merged results; instead preserve it
as a per-company cap and enforce it during per-company selection/merging. Remove
the global application of result_limit before calling
filtering_service.filter_and_rank_jobs and instead pass the original
request.limit_per_company (validated/capped by MAX_RESULT_LIMIT if needed) as a
named per_company_limit argument to filtering_service.filter_and_rank_jobs (or
update filter_and_rank_jobs to accept per_company_limit), so filtering logic
(inside filter_and_rank_jobs) can limit each company's entries individually
before merging into the final result set; keep MAX_RESULT_LIMIT enforcement only
as validation of the requested per-company value (use request.limit_per_company
-> per_company_limit variable, cap that at MAX_RESULT_LIMIT, log if capped, then
pass per_company_limit into filter_and_rank_jobs).

Comment thread scrapper/api/routes.py
Comment on lines +208 to +213
except Exception as e:
logger.error(
f"Job ingestion failed: {str(e)}",
extra={"error": str(e)}
)
raise HTTPException(status_code=500, detail=f"Job ingestion failed: {str(e)}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Raw exception text is returned to clients in 500 responses.

This leaks internal failure details and can expose sensitive internals. Return a generic message and keep specifics in logs.

Suggested fix
-    except Exception as e:
-        logger.error(
-            f"Job ingestion failed: {str(e)}",
-            extra={"error": str(e)}
-        )
-        raise HTTPException(status_code=500, detail=f"Job ingestion failed: {str(e)}")  
+    except Exception as e:
+        logger.exception("Job ingestion failed", extra={"error": str(e)})
+        raise HTTPException(status_code=500, detail="Job ingestion failed")
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 208-208: Do not catch blind exception: Exception

(BLE001)


[warning] 209-212: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


[warning] 210-210: Logging statement uses f-string

(G004)


[warning] 210-210: Use explicit conversion flag

Replace with conversion flag

(RUF010)


[warning] 213-213: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


[warning] 213-213: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapper/api/routes.py` around lines 208 - 213, The except Exception as e
block in scrapper/api/routes.py currently returns the raw exception to clients;
instead, log the full error details (use logger.exception or logger.error with
exc_info=True) inside that except block to preserve stacktrace, and raise
HTTPException(status_code=500, detail="Internal server error") (or another
generic message) so clients do not receive internal exception text; keep
references to the existing logger and HTTPException symbols and replace the
current f"Job ingestion failed: {str(e)}" detail with the generic message while
retaining the detailed error in the logs.

Comment thread scrapper/companies.json
Comment on lines +2 to +198
"greenhouse": [
"stripe",
"notion",
"figma",
"airbnb",
"robinhood",
"coinbase",
"discord",
"dropbox",
"instacart",
"databricks",
"scaleai",
"brex",
"gusto",
"rippling",
"benchling",
"plaid",
"asana",
"intercom",
"zapier",
"segment",
"cloudflare",
"hashicorp",
"snowflake",
"datadog",
"mongodb",
"elastic",
"fastly",
"canva",
"wise",
"revolut",
"klarna",
"n26",
"razorpay",
"cred",
"meesho",
"groww",
"zerodha",
"druva",
"digicert",
"stabilityai",
"freshworks",
"chargebee",
"browserstack",
"postman",
"inmobi",
"unacademy",
"sharechat",
"spinny",
"urbancompany",
"github",
"gitlab",
"slack",
"twilio",
"stripe",
"square",
"shopify",
"hashicorp",
"terraform",
"datadog",
"newrelic",
"splunk",
"salesforce",
"hubspot",
"zendesk",
"okta",
"auth0",
"twitch",
"reddit",
"pinterest",
"medium",
"substack",
"patreon",
"kickstarter",
"indiegogo",
"pebble",
"fitbit",
"garmin",
"sonos",
"oculus",
"htc",
"samsung",
"apple",
"google",
"microsoft",
"amazon",
"meta",
"netflix",
"disney",
"hulu",
"paramount",
"peacock",
"cbs",
"hbo",
"showtime",
"starz",
"apple-tv",
"youtube",
"twitch",
"dailymotion",
"vimeo",
"flickr",
"imgur",
"giphy",
"tenor",
"pinterest",
"tumblr",
"wix",
"squarespace",
"weebly",
"godaddy",
"bluehost",
"hostgator",
"namecheap",
"domain-com",
"aws",
"azure",
"gcp",
"digitalocean",
"heroku",
"vercel",
"netlify",
"render",
"fly-io",
"railway",
"dokku",
"linode",
"vultr",
"lightsail",
"rackspace",
"openstack",
"kubernetes",
"docker",
"jenkins",
"gitlab-ci",
"github-actions",
"circleci",
"travis-ci",
"appveyor",
"buildkite",
"codefresh",
"drone",
"harness",
"atlassian",
"jira",
"confluence",
"bitbucket",
"trello",
"asana",
"monday",
"notion",
"clickup",
"meistertask",
"wrike",
"smartsheet",
"airtable",
"typeform",
"typebot",
"jotform",
"formstack",
"wufoo",
"surveysparrow",
"qualtrics",
"alchemer",
"calendly",
"acuityscheduling",
"vcita",
"booksy",
"mindbody",
"maroochy",
"zoho",
"pipedrive",
"copper",
"freshsales",
"agilecrm",
"insightly",
"zohocrm",
"dynamic365",
"salesforcecrm",
"mailchimp",
"constantcontact",
"convertkit",
"activecampaign",
"klaviyo",
"braze",
"iterable",
"customer-io",
"amplitude",
"mixpanel",
"heap",
"fullstory",
"logrocket",
"sentry",
"rollbar",
"bugsnag",
"appinsights"
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the repeated company slugs.

Because load_companies() preserves this list as-is, the repeated entries here will be scraped multiple times and can produce duplicate jobs plus extra network traffic. If the duplicates are intentional, please document the weighting explicitly; otherwise dedupe the source list before ingestion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapper/companies.json` around lines 2 - 198, The companies list in
scrapper/companies.json contains duplicate slugs (e.g., "stripe", "hashicorp",
"datadog", "pinterest", "asana", "notion", etc.) which will produce duplicate
scrapes; remove duplicate entries from the JSON source so each company slug
appears only once, and also harden load_companies() to dedupe on ingestion
(e.g., convert the loaded array to a Set or use Array.from(new Set(...))) to
prevent future duplicates slipping in.

Comment on lines +34 to +71
def test_ingest_endpoint_no_request_body(client):
"""Test ingest endpoint with no request body - should work with live companies.json."""
# This test uses the real companies.json file
response = client.post("/internal/ingest")

# Should succeed (200) even if it gets 0 jobs
# OR might get rate limited (429) or timeout, but not 500
assert response.status_code in [200, 429, 408, 504]
if response.status_code == 200:
data = response.json()
assert "total" in data
assert "jobs" in data


def test_ingest_endpoint_invalid_source(client):
"""Test ingest with invalid source."""
response = client.post(
"/internal/ingest",
json={"sources": ["invalid_source"]}
)

assert response.status_code == 400
assert "Unknown source" in response.json()["detail"]


def test_ingest_endpoint_response_schema_structure(client):
"""Test that response structure is correct."""
response = client.post("/internal/ingest", json={"companies": ["stripe"]})

# Either success or expected error (not 500)
if response.status_code == 200:
data = response.json()

# Check response schema
assert "total" in data
assert "jobs" in data
assert isinstance(data["total"], int)
assert isinstance(data["jobs"], list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make the ingest tests deterministic.

Both ingest checks still depend on live downstream fetching and allow transient failures to count as success, so CI can go green even when /internal/ingest is timing out or returning no usable payload. Stub the downstream client/source here and fail unless the endpoint returns the expected 200 payload and schema.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapper/tests/test_api.py` around lines 34 - 71, The tests in
scrapper/tests/test_api.py (test_ingest_endpoint_no_request_body,
test_ingest_endpoint_response_schema_structure, and related ingestion tests) are
non-deterministic because they hit live downstream fetchers; replace that by
stubbing/mocking the downstream client/source used by the ingest endpoint (the
module/class/function the route calls to fetch companies/jobs) via the test
client/pytest monkeypatch or a fixture so the endpoint receives a deterministic
payload (e.g., a known companies/jobs response) and always returns 200; then
tighten assertions to require a 200 and validate the exact expected JSON schema
(presence and types for "total" and "jobs") instead of allowing transient
429/408/504—keep test_ingest_endpoint_invalid_source asserting 400 for unknown
sources.

Comment on lines +233 to +244
def test_score_combined(self, test_job):
"""Test combined scoring."""
user_context = {
"preferred_roles": ["backend"],
"skills": ["python", "go"],
"preferred_location": "San Francisco",
"remote_only": True
}
scored = score_job(test_job, user_context)

# Should have: +3 (title) + +2 (desc) + +3 (skills) + +1 (location) + +1 (remote) = 10
assert scored.score >= 10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the exact combined score here.

>= 10 lets accidental bonus points or double-counting slip through without failing the test, so this won't catch regressions in the scoring weights.

✅ Suggested fix
-        assert scored.score >= 10
+        assert scored.score == 10
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_score_combined(self, test_job):
"""Test combined scoring."""
user_context = {
"preferred_roles": ["backend"],
"skills": ["python", "go"],
"preferred_location": "San Francisco",
"remote_only": True
}
scored = score_job(test_job, user_context)
# Should have: +3 (title) + +2 (desc) + +3 (skills) + +1 (location) + +1 (remote) = 10
assert scored.score >= 10
def test_score_combined(self, test_job):
"""Test combined scoring."""
user_context = {
"preferred_roles": ["backend"],
"skills": ["python", "go"],
"preferred_location": "San Francisco",
"remote_only": True
}
scored = score_job(test_job, user_context)
# Should have: +3 (title) + +2 (desc) + +3 (skills) + +1 (location) + +1 (remote) = 10
assert scored.score == 10
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapper/tests/test_filtering.py` around lines 233 - 244, The test
test_score_combined is too permissive — replace the loose assertion "assert
scored.score >= 10" with an exact equality check to catch weighting regressions
(e.g. assert scored.score == 10); update the assertion in
scrapper/tests/test_filtering.py where score_job(test_job, user_context) is
invoked and the resulting scored.score is checked so the test validates the
precise combined total from score_job rather than allowing accidental extra
points.

Comment on lines +45 to +46
self.min_request_interval = 1.0 / requests_per_second
self.last_request_time = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate requests_per_second before interval math

requests_per_second <= 0 currently crashes with a ZeroDivisionError during initialization instead of a clear config error. Fail fast with an explicit validation exception.

Suggested fix
         self.retry_backoff_factor = retry_backoff_factor
         self.requests_per_second = requests_per_second
+        if self.requests_per_second <= 0:
+            raise ValueError("requests_per_second must be > 0")
         
         # Rate limiting
         self.min_request_interval = 1.0 / requests_per_second
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapper/utils/http_client.py` around lines 45 - 46, The constructor
currently computes self.min_request_interval = 1.0 / requests_per_second which
raises ZeroDivisionError for requests_per_second <= 0; update the HttpClient
__init__ to validate requests_per_second > 0 up front (e.g., if
requests_per_second <= 0: raise ValueError("requests_per_second must be > 0"))
before computing self.min_request_interval and setting self.last_request_time so
initialization fails fast with a clear error message.

Comment on lines +95 to +97
# Rate limiting: enforce minimum interval between requests
await self._enforce_rate_limit()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rate limiter is not concurrency-safe

The current rate-limit state (last_request_time) is mutated without a lock, so concurrent requests can pass the check together and exceed the configured RPS. This breaks throttling guarantees under load.

Suggested fix
 class HttpClient:
@@
-        self.last_request_time = None
+        self.last_request_time = None
+        self._rate_limit_lock = asyncio.Lock()
@@
     async def _enforce_rate_limit(self):
         """Enforce rate limiting by enforcing minimum interval between requests."""
-        if self.last_request_time is not None:
-            elapsed = (datetime.now() - self.last_request_time).total_seconds()
-            if elapsed < self.min_request_interval:
-                await asyncio.sleep(self.min_request_interval - elapsed)
-        
-        self.last_request_time = datetime.now()
+        async with self._rate_limit_lock:
+            if self.last_request_time is not None:
+                elapsed = (datetime.now() - self.last_request_time).total_seconds()
+                if elapsed < self.min_request_interval:
+                    await asyncio.sleep(self.min_request_interval - elapsed)
+            self.last_request_time = datetime.now()

Also applies to: 148-155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapper/utils/http_client.py` around lines 95 - 97, The rate limiter is not
concurrency-safe because last_request_time is mutated without synchronization;
add an asyncio.Lock (e.g. self._rate_lock) in the HttpClient initializer and
wrap the logic inside _enforce_rate_limit with that lock: acquire the lock,
compute now and required wait = interval - (now - self.last_request_time), if
wait > 0 await asyncio.sleep(wait), then update self.last_request_time =
time.monotonic() (or now + wait) and release the lock so only one coroutine
updates the timestamp at a time; apply the same locking change wherever
_enforce_rate_limit is called (and ensure the lock is created and used
consistently in the HttpClient class).

Comment thread scrapper/utils/logger.py
Comment on lines +13 to +36
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"module": record.name,
"message": record.getMessage(),
}

# Add extra fields if present
if hasattr(record, "source"):
log_data["source"] = record.source
if hasattr(record, "company"):
log_data["company"] = record.company
if hasattr(record, "status"):
log_data["status"] = record.status
if hasattr(record, "duration_ms"):
log_data["duration_ms"] = record.duration_ms
if hasattr(record, "job_count"):
log_data["job_count"] = record.job_count

# Add exception info if present
if record.exc_info:
log_data["exception"] = self.formatException(record.exc_info)

return json.dumps(log_data)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the structured fields emitted by the helpers.

JsonFormatter only serializes source/company/status/duration_ms/job_count, so the new ingest counters and error payloads added by log_ingest_start(), log_source_fetch(), and log_ingest_complete() are silently dropped. Expand the formatter or make the helper payloads match the formatter so the JSON logs actually contain the metadata they advertise.

🔧 Suggested fix
-        # Add extra fields if present
-        if hasattr(record, "source"):
-            log_data["source"] = record.source
-        if hasattr(record, "company"):
-            log_data["company"] = record.company
-        if hasattr(record, "status"):
-            log_data["status"] = record.status
-        if hasattr(record, "duration_ms"):
-            log_data["duration_ms"] = record.duration_ms
-        if hasattr(record, "job_count"):
-            log_data["job_count"] = record.job_count
+        for field in (
+            "source",
+            "company",
+            "status",
+            "duration_ms",
+            "job_count",
+            "sources",
+            "companies",
+            "total_jobs",
+            "errors",
+            "error",
+        ):
+            if hasattr(record, field):
+                log_data[field] = getattr(record, field)

Also applies to: 62-112

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 14-14: datetime.datetime.utcnow() used

(DTZ003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapper/utils/logger.py` around lines 13 - 36, The JsonFormatter in
scrapper/utils/logger.py currently only serializes
source/company/status/duration_ms/job_count and is dropping the additional
metadata emitted by log_ingest_start(), log_source_fetch(), and
log_ingest_complete() (ingest counters and error payloads); update JsonFormatter
to check for and include the ingest-related fields (e.g., any record attributes
like ingest_counters, ingest_stats, ingest_count, ingest_errors, error or
error_payload) and serialize them into log_data when present, or alternatively
adjust the helper functions
(log_ingest_start/log_source_fetch/log_ingest_complete) to use the existing
field names the formatter expects so the structured metadata is preserved.
Ensure you reference JsonFormatter in scrapper/utils/logger.py and the helper
functions log_ingest_start, log_source_fetch, and log_ingest_complete when
making the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant