System Architecture
The google-play-scraper is designed to be fast, reliable, and strictly typed. Instead of relying purely on fragile HTML XPath/CSS selectors, it primarily extracts state data from embedded JSON arrays and reverse-engineers Google's internal RPC requests.
Data Extraction Strategy
Google Play pages are rendered server-side using large, deeply nested JSON arrays commonly injected into <script> tags named AF_initDataCallback.
JSON Array Traversal (ds:x)
When you call a function like get_app_details(), the scraper:
1. Downloads the initial HTML using httpx.
2. Uses regex to find the JavaScript variable assignments or callbacks containing the required data block (usually ds:5 or ds:4).
3. Loads the raw string into a Python list.
4. Uses carefully mapped indices to retrieve specific metadata.
Example index mapping:
Developer Name is often found at data[1][2][68][0].
Because Google can update these indices at any time, accessing them directly can result in IndexError or TypeError. We mitigate this heavily in the _get_val() and _map_to_app_details functions using structural pattern matching and specific exception handling.
Internal RPCs (batchexecute)
For paginated data (like endless scrolling Reviews) or secondary data views (like Permissions and Data Safety), Google relies on a single heavily-obfuscated POST endpoint:
https://play.google.com/_/PlayStoreUi/data/batchexecute
The scraper interacts with this endpoint via the internal _batch_execute() method.
How it works:
1. We send a POST request containing a heavily-formatted, URL-encoded payload array.
2. This array typically requires a specific "RPC ID" (e.g., UsvDTd for user reviews).
3. The response is a JSON envelope that contains a stringified JSON array inside one of its elements.
4. We must carefully slice the protective prefix off the response, decode the outer JSON, extract the inner string, and decode it as a second JSON payload to access the actual multidimensional data array.
Core Components
PlayScraper (scraper.py)
This is the main interaction point. It manages the underlying httpx.Client lifecycle using a context manager to maintain HTTP connections (Keep-Alive) globally across multiple requests.
Data Models (models.py)
Because the extracted data from Google Play lists can be unpredictable (e.g., missing fields, null types, or string/number type confusion), Pydantic is heavily utilized.
Instead of passing untyped dictionaries around, methods like get_app_details() map the raw, nested list indices directly into structured Pydantic models (like AppDetails, Review, DataSafety).
This guarantees that consumer software using this library always receives predictable, strictly-typed data, shifting the parsing failures to the lowest level (the extraction phase) rather than polluting the user's application logic.