In July 2021, I returned to Persona as a full-time member of the team after working with them through two internships. I resonated strongly with the team's values and was drawn to their authenticity, humbleness, and kindness — before my first internship, they even sent me a personalized welcome video. Honestly super wholesome <3.
However, another reason I decided to return was I knew that, based on my internship experience, I'd get a high level of responsibility and ownership as a full-time employee. In my last internship, I was assigned as the directly responsible individual (DRI) of engineering for a large project and even got to work with the project manager to drive product decisions. Throughout the process, I received mentorship and guidance primarily in the form of feedback on proposals, design documents, and other written materials.
One of the core components of Persona's platform is automated identity verification (IDV). My task was to develop a system — called the Government ID Registry — that can detect government IDs that were unexpectedly submitted repeatedly and IDs that appeared to be manipulated versions of previously submitted IDs. For context, the former case may be useful to prevent users from creating multiple accounts to take advantage of promotions. The latter can be taken as a strong indicator that a government ID is fraudulent or a victim of manipulation of some sort.
In addition to effectively detecting such IDs, the design and implementation also had to:
- Allow curation of the possible set of matching IDs
- Minimize additional latency on our verification flow's processing times
- Be scalable with customer volume
Designing a high-level solution
Before figuring out the technical details, we had to determine how the technology should be modeled. We decided that the Government ID Registry should represent multiple databases of government IDs that can be maintained separately for different purposes. This approach enables the set of possible matching IDs to be curated if necessary and also naturally creates logical separation. We would be able to maintain databases for different customers and different use cases if necessary.
Following its modeling as a database, a registry can be queried for similar government IDs given a single government ID. Queries can search for government IDs with similar textual personally identifiable information (PII), face, both, or just one of the two. The granularity of these queries enables us to identify the very IDs that we are trying to detect with this project. Furthermore, government IDs can be both added and removed from the registry as one would expect of a database.
The next step was to figure out how we could accomplish the behavior of the described queries. This was challenging because Persona's IDV platform is fully automated. It's easy for humans to check if the details or face on two IDs are similar, but fully automated IDV solutions rely on optical character recognition (OCR) technology — which is not perfect to say the least. As a result, OCR inaccuracies are inevitable and needed to be accounted for in our design to minimize missed detections. By defining and employing searches on multiple reasonably unique identifiers, which are just strategic subsets of the attributes on a government ID, we patched up many of the holes caused by OCR imperfections. **Note that we consider an identifier reasonable if it should theoretically only legitimately identify at most a few individuals in the global population.
With an approach to reliably search for ID similarities, we defined the criteria in which an ID is classified as a repeated ID or an ID that is potentially manipulated (i.e. an inconsistent repeat ID):
- A government ID that is similar in both face portrait and at least one of the other textual identifiers is considered a repeat ID.
- A government ID that is similar in only either the face portrait or some set of the other textual identifiers is considered an inconsistent repeat ID
The theory behind the latter criteria is that since we consider each identifier as identifying a unique individual, each instance of one type of identifier should map to exactly one of each of the other types of identifiers. Thus, it would be unexpected if a single instance of a textual identifier maps to more than a single face. The exception is IDs that are of a different class (e.g. a driver's license versus a passport), but this exception is trivially handled.
We went through several iterations before reaching this high-level design. Special thanks to my team for being so invested in my proposals and helping reach a solution design that is clean and aligns very closely with how we would communicate the technology to our customers. This particular iterative process was probably one of the main ways I grew throughout the internship.
Querying similar IDs by textual PII
To achieve fast, scalable, and reliable querying of government IDs based off textual PII identifiers, we needed to build a robust search database.
This part of my internship required a lot of experimentation and research because it was a new type of problem within Persona and there were no in-house experts on the subject. This was a unique experience for me, but it was really enjoyable to be able to lay the foundation for something.
We tested several database index designs and benchmarked each one in both performance and efficacy. There was an additional requirement of supporting fuzzy matching, which was a large part of the benchmarking process for efficacy. After testing, we were able to successfully select an index design with adequate levels of performance. To boost the efficacy of fuzzy matching, however, we implemented additional fine-grain server-side filtering on top of the candidates returned from a query to our search database.
Querying similar IDs by face
The next major component was the ability to query for government IDs that shared a similar face as the submitted ID. In isolation, this component was quite straightforward because we could leverage our existing biometric models used for IDV as part of the implementation.
Putting the two together
With these two components together, the Government ID Registry was able to detect IDs of interest: repeated government IDs and inconsistent repeat government IDs. From a high level, repeated government IDs were IDs present in the resultant set of both queries and inconsistent repeat IDs were IDs that belonged to only one set.
Challenges with managing registries
Figuring out how to implement the ability to add and remove IDs from a registry was surprisingly complex. To support both text matching and face matching capabilities, we referenced two disconnected data stores under the hood: our search database for the textual attributes of the ID and a separate face biometrics database for the associated faces.
While the collaboration of these two data stores enabled the capabilities of the Government ID Registry, certain complications surfaced as a result. Synchronization between the disconnected data stores was a significant challenge. If the two data stores fall out of sync, it may be possible to produce a match on an ID that should have been removed from the registry. An even worse scenario is if an ID is only removed from one of the data stores and returned by a query to the other. In this scenario, the ID may be flagged as an inconsistent repeat when in actuality it is a regular repeat or nothing in particular at all.
Data store synchronization was an interesting challenge that in all honesty I had somewhat overlooked. I did not realize how severe of an effect a small degree of being out of sync would have on the usability of the inconsistent repeat check. With that said, maintaining synchronization in a robust way while keeping the method relatively simple was an interesting technical design exercise.
Balancing false positives and true positives
Because the goal of this project was fraud detection, it was crucial that the number of false positives was kept within tolerable levels. Fraud generally shows up in the long tail, which means that fraud checks are extremely sensitive to false positives and noise. With potential efficacy and processing time concerns in production, we decided to roll out the feature to customers who opted in for a beta test. From this beta test we observed that the false positivity rate was too high for the inconsistent repeat ID check, mainly because:
- As mentioned before, whether it is due to technical reasons or user errors such as a poor image quality, OCR technology is not 100% reliable. As a result, OCR inaccuracies could be mistakenly interpreted as a sign of ID manipulation by the Government ID Registry.
- Facial recognition technology can be imprecise for a variety of reasons, including increasing human age, real-life doppelgängers, and simply similar looking people.
It was not feasible to adequately resolve any of these issues directly in the short term. As a result, we decided to investigate certain tweaks we could perform that unfortunately reduce the breadth of true positives we can catch, but at the same time reduce the number of false positives an order of magnitude more as a result. Fortunately, many fraud vectors that the Government ID Registry falls short on are covered by other tools in Persona's arsenal. In other words, despite some breadth reduction in fraud vectors, the Government ID Registry remains a solid addition to Persona's holistic approach to IDV.
This part of the project was not particularly difficult from a technical perspective, however, theorizing about these potential tweaks had its own challenges. This process reminded me of the UDP network protocol and the process behind determining the usability of UDP for certain types of applications. With an underlying system in which its reliability cannot completely be trusted, in what ways can it still be used practically. Though it's not a perfect analogy, I thought the aspect of trying to balance these product-centric trade-offs to be quite similar and also a unique experience for me.
Performance considerations and optimizations
At Persona, we strive to keep our IDV processing times as low as possible. As a result, one of the requirements of this project was minimal additional latency. After several optimizations and refactors including minimizing the number of nontrivial facial recognition operations and parallelizations, we achieved an additional latency of approximately two seconds at scale at the 90th percentile (P90). Although there is still room for improvement, this was reasonable for the additional capabilities that the Government ID Registry provides.
Out of all aspects of the project, this part reminded me the most of what you would learn in a CS algorithms class or would be thinking about when doing LeetCode. Although it was more than just a raw LeetCode question because of the additional context that needed to be considered, it was cool to see my educational background actually being relevant in a real life work scenario.
Project impact and final reflections
This entire project was quite a challenge, as I had never built something with such little groundwork laid out in a professional environment before. The experience came with many highs... but also some lows; in particular, when I discovered the atrociously poor accuracy for one of the use cases in the beta test. And when I say atrociously poor, I mean effectively 0%. Despite my seemingly failed creation, my team was supportive. They reminded me about the experimental nature of many of the engineering endeavors at Persona and that my experience was not one to feel sorry about. I eventually managed to improve the accuracy to a highly usable level with the balancing act mentioned in the previous section, so thankfully I could still be confident that I had a job after I graduated lol.
During my final university semester, I kept in touch with my mentor at Persona, who told me that the system I built was working well — thank goodness because I was still a little skeptical of it after I left! After returning full-time, I was debriefed on more details of the production audit results in addition to some customer feedback the team received about the impact of this project. All of the feedback was very positive, which gave me a sense of accomplishment. From the audit, we saw that our repeat ID detection was over 99% accurate, detecting more than 10,000 repeat ID submissions every week across the platform. In contrast to the disconcerting initial results of the inconsistent repeat detection check, some of our customers have now expressed to us that it was one of their most effective means of deterring fraud.
I am grateful to have had the opportunity to drive such a complex project to a full rollout in an internship. It was a new experience that prompted a lot of professional growth for me from both an engineering and a product perspective. With that said, onto new adventures as a full-time member of the Persona team!