Appearance similarity learning for multi-person tracking and re-identification

  1. Gómez Silva, María José
Dirigida por:
  1. José María Armingol Moreno Director/a
  2. Arturo de la Escalera Hueso Codirector/a

Universidad de defensa: Universidad Carlos III de Madrid

Fecha de defensa: 10 de diciembre de 2019

Tribunal:
  1. María Araceli Sanchís de Miguel Presidente/a
  2. Santiago Salamanca Miño Secretario/a
  3. Daniel Olmeda Reino Vocal

Tipo: Tesis

Resumen

In the last decades, the presence of video-surveillance sensors has been a continuously growing phenomenon, boosted by the increasing awareness about guaranteeing the citizens’ security against emerging and already existing threats. A huge number of surveillance systems have been deployed in public and private infrastructures, such as military bases, airports, power plants, banking, or campuses. The traditional surveillance systems, where a human operator monitors several video streams, has become inefficient in light of the rapid augmentation of the number of installed sensors, the limited attention span and the expensive cost of labour. Therefore, the research on Intelligent Surveillance Systems (ISSs) has been prompted by the necessity of automatically managing the huge amount of captured data to reduce costs and the response time against a dangerous or unauthorised event. For that, ISS research activity has been even encouraged by European Government's policies through significant investment programmes, and it has powerful applications in the substantial and rising market of video-surveillance. Researching on computer vision and artificial intelligence are the leading force to boost the development of such intelligent systems, with a wide range of monitoring functionalities, including automatic detection and tracking of people to perform further analysis of their behaviours. The automation of the simultaneous tracking of multiple individuals, even across non-overlapping camera views, poses a challenging landmark, whose achievement is essential for the Surveillance endeavour in distributed networks of cooperative cameras. Due to the unavailability of fine-biometrics cues, and the difficulty to establish temporal and spatial constraints in a network of non-overlapped and distant sensors, people tracking relies on visual data. Besides that, the presence of frequent individuals’ occlusions, interactions and changing trajectories, in unconstrained and crowded scenarios, makes insufficient the use of algorithms relying only on motion cues and turns the individuals' appearance into valuable information to recognise people. This thesis states and discuss the approach of addressing Multi-Person Tracking (MOT) and Re-Identification (Re-Id) through the learning of a Degree of Appearance Similarity (DoAS) measurement. Instead of individually modelling every person's appearance, the proposed DoAS model treats the identification problem as a pair-wise binary classification task. It compares two images and determines whether they belong to the same person or not, so it is able to identify any unknown individual, which is an essential capacity given the unpredictable nature of the surveillance task in real unstructured scenarios. The tracking of a person in a video-sequence, especially after visual occlusions, and his/her re-identification through different camera views involve the recognition of his/her identity among all detected people in the scene. This is achieved by calculating the distances between the image rendering the query individual and the rest of the available detections coming from different frames or even different camera views. Subsequently, the correct match is that presenting the smallest distance. However, the distances are not computed directly over the raw images but over a representative feature array. The Degree of Appearance Similarity between two images is defined as the probability that they render the same person. This is the complementary value of the distance between their features. Therefore, it is necessary to learn an embedding from an image into a feature space, such that the distances between samples corresponding to the same person are smaller than that to different people. This is a daunting challenge due to the low resolution of the images, the presence of people with similar appearance and the viewing changes between camera pairs. The feature embedding has been modelled by a Deep Convolutional Neural Network (DCNN). Two learning models have been tried to train the DCNN: the Siamese and the Triplet model. Siamese architecture duplicates the DCNN in two identical branches sharing parameters and joined in the last layer, where the loss function performs a pairwise verification. This model is fed with pairs of images. Positive pairs are formed by images corresponding to the same person, and negative pairs, to different people. Analogously, Triplet model uses three branches to compare a positive and a negative pair of images. The input is a triplet of images, one is the anchor or reference image, which is coupled with two images, one belonging to the same person and the other one to a different individual. Therefore, pairs and triplets of person images have been used to feed a contrastive architecture, to learn a DCNN model to automatically learn discriminative and invariant appearance features. The capacity of the DoAS metric to identify a person by comparing a pair of images has been exploited to perform both Single-Shot Re-Identification (Re-Id) and online Multi-Object Tracking (MOT). Single-Shot Re-Id consists of the recognition of a person’s identity through non-overlapping camera views, and from only a pair of snapshots (one per camera), in order to reduce the quantity of data transmitted among cooperative sensors. The re-identification of a person by an appearance neural model arises incipient barriers due to its intrinsically unbalanced nature, given the lack of data about the people to identify and the huge number of possible false assignments with other detected agents. This results in the over-fitting and the collapse of the deep neural models used to render the visual appearance. In order to increase the discriminative power of the DoAS model and to diminish the incipient challenges posed by the Single-Shot Re-Id task, a collection of learning techniques has been developed and analysed. Contributions have been made in multiples parts of the re-id learning, from the generation and augmentation of the training data to the designing of new connection and loss functions. The developed methods palliate the problem of the shortage of re-id data preventing extra data acquisition and labelling costs. For instance, the discriminative capacity of the re-id model has been increased through the design of a new Normalised Double Margin-based Contrastive Loss function. Furthermore, the triplet model learning proposed in [2], has been adapted to the single-shot re-identification task by the formulation of a mini-batch triplet-based gradient descent algorithm. The available data is not enough to adopt an individual-meant approach, which clusters the samples of the same person close from each other and distant to another person identity cluster. On the contrary, the proposed method treats all the possible positive pairs as a set rendering the condition of similarity, and the negative pairs represent the dissimilarity situation. The proposed mini-batch-based learning algorithm enables the learning of these two classes, similarity and dissimilarity, through a vast quantity of samples for both. Moreover, this thesis proposes an innovative idea: transferring the learning previously acquired on the MOT domain to the Re-Id model. The intuition under this approach is that the most representative features of a person are automatically learnt on a MOT dataset. From the set of learnt descriptors, the low-level ones, which are learnt in the earlier layers of the network model, are kept. Then, the most high-level representations, coded in the further layers, are fine-tuned on a re-id target dataset to make them more discriminative. The proposed transfer learning method presents the following benefits with respect to the transference of learning from other re-id datasets or from classification models: firstly, the re-id network architecture does not require any modification to transfer learning from people tracking to re-identification, since the resolution and aspect ratio of the person representations in both domains are quite similar. Secondly, a data generation tool has been designed to create pairs of images from MOT sequences, which allows getting a larger training dataset. Finally, samples from different MOT sequences have been coupled, which helps to avoid the dependence on the characteristics of a certain camera view, and consequently, to avoid the negative transfer learning. Besides, a novel method, called “Triplets Permutation” has been proposed. This method formulates different modes to combine the person images of a re-identification dataset to generate triplets. This allows increasing the variety of triplets obtained from a certain dataset to alleviate the problem of insufficient data, and consequently, the model overfitting. Moreover, new neural layers have been designed to perform online data augmentation, so that the creation of new samples is integrated into the learning process, and no previous offline stage is required to create new data and neither memory space to save it. Not only, learning techniques had been developed but also neural architectures, which are based on different human shapes partitions to integrate spatial information into the learnt features, and to make them robust against pose and background variations and to deal with the misalignment problem of re-identification task. Moreover, a comparative study of different existing networks architectures has been performed, and the effects of shortening the models have been analysed, concluding that the simpler the model, the less prone to be over-fitted it is. As long as distance metric is concerned, the Mahalanobis distance has been proposed as a distance metric when re-identification is performed from two fixed camera views. On contrary to previous works, where the Mahalanobis matrix is estimated from a set of previously computed features, this thesis proposed the integration of an estimation process into the features learning process, affecting on the features learning evolution, and improving it simultaneously. The treatment of the Mahalanobis matrix elements as weights of a neural layer brings a novel solution to the estimation of camera-to-camera transformation to deal with viewing variations. A developing framework has been designed and implemented to train, validate and test all the designed methods, layers and architectures. This is constituted by a set of tools with different purposes, including datasets generation, interpreting learnt models in different programming languages, and evaluation of neural models by measuring several metric values. Most of these tools have been publicly delivered since their modularity make them suitable form many different learning applications. Besides that, the research and experiments conducted along with this dissertation about the application of deep learning techniques to model the DoAS for Single-Shot Re-Identification, have demonstrated some hypotheses, which have been disseminated through multiple publications: • In single-shot Re-Id, there are only two instances of a certain individual against the high number of different people. Due to this unbalanced nature, a certain compensation in the sizes of the training sets of positive and negative pairs is necessary to avoid that the learning process ends up ignoring the positive pairs’ contributions and collapsing. However, a model trained on a partially balanced training set presents higher performance than if it is trained on an absolutely balanced set since this last set differs too much from the target task, where the model has to deal with the unbalanced data. • Triplet model is more effective than Siamese one to train a re-identification model since triplet loss relies on a relative distance that makes it more flexible against the images’ variations, caused by their capture from different camera views. However, in mono-camera multi-object tracking, the variations among the images of the same person even after temporary disappearances, are not so wide as in re-identification. Therefore, in the MOT domain, the Siamese model can perform the discrimination of the pairs of samples in two well-differentiated groups, positive and negative pairs, better than the triplet model. • The application of some techniques, such as data augmentation and body shape partition not always increase the performance of the re-id model. In general, these strategies bring the enhancement of the results, because data augmentation enlarges the variety of training samples and the body shape partition reduces the dependence on the pose and background, so the model is focused on dealing with the variations among different camera views. However, this is not the case when the Re-Id model is meant to re-identify people from two fixed camera views, where the variations between cameras can be modelled by the network. In that case, the subtraction of the background and the introduction of a wide variety of training samples disturb such modelling. • In general, the training of a model on larger datasets results in an enhancement of its performance, as long as, the training data do not differ in a certain measure from the target test data. In conclusion, the results have demonstrated the potential application of deep learning to solve the challenging task of Single-Shot Re-Identification, as long as the lack of Re-Id data is faced by a proper strategy. Furthermore, the properties of the DoAS measurement also make it suitable for measuring the score of assigning a certain identity to a detected person in the process of association of the individuals’ observations through different frames in a tracking algorithm. This thesis proposes a Multi-Person Tracking algorithm whose core is an online Cascade Data Association method to perform frame-to-frame identities assignment. The association process has been divided into different and consecutive levels. Through this hierarchical association process, a specialised solution is provided to different critical tracking situations, such as the intersection of agents, their occlusion, disappearances, and the birth of new tracks. This strategy avoids the problem of merged and split observations, caused by crossing agents. The data association at every level is mainly performed by a multiple global hypotheses generator. Every hypothesis describes a probable assignment between a new set of detections and their identities. This global approach prevents from possible identities switches since each agent assignment is conditioned by the rest of the associations in a frame. Instead of using pruning methods to filter the generated hypotheses to reduce the number of probable ones, the generation method is implicit conditioned by a previously calculated mask of impossible associations. In that way, only the notably probable hypotheses are generated and proposed, reducing the computational time, and the birth of new tracks is also considered on these hypotheses. Furthermore, to measure the cost of matching a certain identity with a detection, a novel formulation has been designed. Trackers based on only targets dynamics are not able to handle rapid changes in motions or varying trajectories agents. On the other hand, relying only on appearance models can be problematic when the scene is highly crowded or when individuals present a similar appearance. For that reason, this thesis defines a multi-modal cost function based on the adaptive weighting of position and visual appearance cues. The cost value computation can take different formulations according to the presence of ambiguities due to the presence of crowds, crossing agents, occlusions, and missed and redundant detections. The similarity appearance between different observations is given by the DoAS model. Instead of learning specified patterns for each one of the tracked agents, a unique model has been pre-trained to identify the appearance similarity between different images of the same person, allowing the tracking of multiple people using the same model for all of them. In that way, this method does not require previous knowledge about the scene, and it presents the same accuracy performance from the first frame of the tracking. In addition, temporal consistency has been added to the DoAS model by means of training a Long Short-Term Memory (LSTM) cell, which compares every detection not only with the last saved observations but also with those from previous frames. This approach allows a frame-to-frame assignment without depending on future observations like it is done in batch algorithms. This model has been trained to deal with failed associations in previous frames, preventing from the further transmission of the error. A variate set of tracklets (fragments of tracks) has been specially generated with such purposes, by deliberately introducing temporal steps between some people detections, as well as, intruders’ detections. In that way, the model has been trained to face complex real surveillance situations. Furthermore, to face the herein errors from the detector, which causes missed and false positive agents, two strategies base on the geometrical analysis of human skeletons have been designed: firstly, a filter to validate the human shapes of the detected objects, which geometrically analyses the structured described by the joints detected by a Convolutional Pose Machine (CPM), [8]. Secondly, the inference of the location of missed agents has been performed by a detection-by-tracking approach, that uses the CPM to search a human shape in the location where each missed agent is expected to be found. The developed MOT algorithm is based on dedicated structures to manage all the employed tracking cues, as well as, their identities assignments and state updating. These representations serve to a multi-agent management strategy, which is able to generate newly appearing tracks, update the already existing identities, and re-identify the temporal disappeared agents. The designed association method is versatile and robust against multiple situations and scenarios. It has been evaluated over sequences presenting a wide variety and number of people, in outdoors and indoors scenarios, and from fixed and mobile cameras, from different perspectives. For all the sequences the setting of the hyper-parameters of the tracking algorithm was the same, so the versatility of a universal algorithm has been proved. Furthermore, the proposed system is modular and scalable. Hence, the developed methods are susceptible of application to different problems, including face recognition or tracking with search and rescue purposes, as it has been also proved. Finally, the experimental results demonstrate that deeply learned features provide larger discriminative power than low-level hand-crafted features and prove the potential application of deep learning to solve the challenging task of measuring the DoAS between two images, as long as the lack of training data is faced by a proper strategy. Therefore, the DoAS model has proved to have the ability to perform the surveillance tasks of Multi-Object Tracking and Person Re-identification, which is the main statement of this dissertation. In addition, the developed learning techniques and methods, which are analysed in this thesis, and the generated knowledge provide a valuable contribution to the field of deep learning and surveillance algorithms research. [2] Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815-823). [8] Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4724-4732).